SSH Key Rotation at Enterprise Scale: A Working Playbook
DataDike Security Research
PAM Research & Field Engineering
SSH keys are the most quietly accumulated privileged credential in any infrastructure. Every developer who has ever opened a Linux jump host adds one. Every CI/CD bootstrap script that wanted "easy" access leaves one behind. Every contractor who briefly held production access takes one with them when they leave. The result, ten years into the estate's life, is an authorized_keys ecology that nobody fully maps and nobody confidently rotates.
This article walks through a rotation program that has worked across Linux estates of 500 to 50,000 hosts. The framing is operational: inventory, rotation, exception handling, and the small set of metrics that tell you whether the program is still healthy.
Inventory first — and not the inventory you think you have
Run a sweep across the estate that does three things, all of which must be present for the inventory to be trustworthy: enumerate every authorized_keys file on every host, fingerprint each key, and correlate the fingerprint to a human owner if one exists. The first two are mechanical; the third is the work.
# Run as root via your config-management tool (Ansible/Puppet/Chef/Salt).
# Inventories authorized_keys for every user account on the host.
for user_home in /home/* /root; do
[ -f "${user_home}/.ssh/authorized_keys" ] || continue
while read -r key_line; do
[ -z "${key_line}" ] && continue
[[ "${key_line}" =~ ^# ]] && continue
fingerprint=$(echo "${key_line}" | ssh-keygen -lf /dev/stdin 2>/dev/null | awk '{print $2}')
echo "$(hostname),${user_home},${fingerprint},${key_line}"
done < "${user_home}/.ssh/authorized_keys"
doneFeed the output into a key-fingerprint table that joins against your IDP. The unjoinable rows — keys whose fingerprint is not associated with any current employee — are the first finding. In the engagements we run, this finding is consistently 50–70% of the inventory. Those keys are not necessarily malicious. They are former contractors, decommissioned automation, lost laptops, and one very old emergency-access key from 2018 that everyone forgot about.
Cohort the rotation, do not boil the ocean
A single mass rotation across a 10,000-host estate fails in two ways: the blast radius of a mistake is the entire estate, and the dependencies you discover (a backup job that uses an old key, a monitoring agent with a hardcoded thumbprint) take down production simultaneously. Cohort by blast radius instead.
- Cohort 1: non-production hosts and lab environments. Rotate aggressively. Anything that breaks here is a finding, not an outage. Two weeks.
- Cohort 2: production hosts with low business criticality (internal tools, dev support infrastructure). Rotate with a 24-hour rollback window. Three weeks.
- Cohort 3: production hosts with mid-criticality (analytics, internal apps). Same approach with paired observability. Four weeks.
- Cohort 4: tier-zero production (customer-facing, payment, identity, data-platform). Cohort 4 is rotated by named owner with a runbook per service, paired SRE, and a change window. Eight weeks.
Discovery beats theory
Cohort 1 will discover 80% of the hidden dependencies in your estate. By the time you reach Cohort 4, the remaining surprises are knowable and the operators have built muscle memory.
The four mistakes that delay programs by quarters
- Rotating keys without first vaulting them. The new key becomes the next orphan. Vault first, rotate second, log everything to one stream.
- Allowing exceptions without expiry. "We will rotate that later" becomes never. Every exception gets a date and an owner; the date is enforced in the same dashboard the rotation runs out of.
- Letting CI/CD opt out indefinitely. Pipeline keys are the highest-risk concentrated credential in modern engineering organizations. They get short-lived issuance from a workload-identity broker, not a hardcoded private key in a secrets manager.
- Skipping the ownership reassignment. A key with no owner becomes a stale key the moment the team reorganizes. Ownership reassignment is a quarterly task with the same cadence as access review.
The cadence that survives
Once the estate is clean, the steady-state rotation cadence we recommend is: human user keys rotate every 90 days, service-account keys rotate every 30 days, CI/CD keys are replaced with short-lived workload-identity tokens that issue per-build (no rotation needed because there is no long-lived key). The 90-day human cadence aligns with PCI-DSS guidance; the 30-day service-account cadence reflects how often something else in the environment will surface a stale dependency that needs to be fixed before the next attacker discovers it.
How DataDike handles SSH key rotation
DataDike provides agentless SSH-key inventory across the estate via SSH itself — no software on the host — and produces the same fingerprint table the script above generates, joined to the IDP, on a continuous basis. Rotation is a workflow object: pick a cohort, pick a target cadence, the platform handles the orchestration with rollback windows, paired observability, and an audit log that satisfies any external review.
For CI/CD specifically, DataDike issues short-lived SSH certificates signed by an internal CA, scoped to the pipeline workload identity. The pipeline gets just-in-time SSH access without a static private key anywhere in the chain. That single shift retires the largest category of hardcoded credential most engineering organizations carry.