Shell scripts often start life as a quick fix: a few commands copied from a ticket, pasted into a terminal, then saved as fix.sh and scheduled in cron. Weeks later, that “temporary” script is now a critical part of patching, backups, log rotation, user provisioning, or environment bootstrap. At that point, the difference between “works on my machine” and “operates reliably on 2,000 servers” is almost always your engineering discipline.
This guide is about Linux shell scripting best practices you can apply immediately as an IT administrator or system engineer. It treats scripts as production artifacts: designed for repeatable execution, safe failure modes, auditability, and long-term maintenance. The focus is on Bash and POSIX sh concepts that matter in real estates (mixed distros, cron/systemd, SSH fan-out, containers, and minimal images).
Throughout, you’ll see patterns that compose well together: safe defaults, clear interfaces, explicit dependencies, predictable logging, defensive parsing, and simple test hooks. The goal isn’t “clever shell”; it’s automation you can trust at 03:00.
Decide what you’re writing: POSIX sh vs Bash (and document it)
One of the first reliability decisions is the interpreter. A large percentage of scripting bugs in enterprise Linux comes from assuming Bash features while running under /bin/sh, especially on distributions where /bin/sh is dash (common on Debian/Ubuntu). Arrays, [[ ... ]], brace expansion, pipefail, and declare are frequent culprits.
If you need Bash features, commit to Bash and declare it explicitly in the shebang:
#!/usr/bin/env bash
Using /usr/bin/env helps when Bash is not at /bin/bash, which can matter in some container images or non-standard environments. If your script must run in minimal POSIX environments (initramfs, busybox, some embedded systems), then target POSIX sh and avoid Bash-only syntax.
Whatever you choose, document it at the top of the file in a comment and keep the script consistent. A small upfront choice here prevents subtle runtime failures when a script is run by cron (which may default to sh) or invoked by another tool.
Start with a “production header”: strictness, safety, and predictable environment
Shell is permissive by default. It will happily keep going after a command fails, expand unset variables to empty strings, and split values on whitespace in ways that can rewrite your intent. In production automation, permissiveness becomes risk.
For Bash scripts, a common baseline is “strict mode”:
bash
#!/usr/bin/env bash
set -Eeuo pipefail
# Safer word splitting
IFS=$'\n\t'
# Optional: predictable glob behavior
shopt -s failglob
Here’s what each part buys you:
set -e exits on an unhandled non-zero exit status. This prevents “half-success” runs that silently skip failed steps.
set -u treats unset variables as errors. This forces you to initialize inputs and reduces the chance that empty strings become paths like / or commands operate on unintended targets.
set -o pipefail makes pipelines fail if any component fails, not just the last one. Without it, cmd1 | cmd2 can mask a failure in cmd1.
set -E ensures ERR traps are inherited by functions and subshells (with caveats), which matters once you add centralized error handling.
IFS=$'\n\t' reduces unexpected word splitting on spaces. This doesn’t “fix filenames with spaces” on its own, but it removes a major foot-gun when iterating over lines.
shopt -s failglob (Bash) turns unmatched globs into errors instead of passing the literal pattern through. This can prevent operations like rm /path/*.log from turning into rm /path/*.log (literal) and failing silently or behaving unexpectedly.
If you are targeting POSIX sh, you cannot use pipefail or shopt. In that case, prioritize explicit exit-code checks and careful quoting, and be disciplined about error propagation.
Minimize inherited state: locale, PATH, umask
Even with strict mode, scripts can behave differently depending on inherited environment. For automation, it’s better to pin down what affects parsing, discovery of binaries, and file permissions.
Locale differences can change sort order, character classes, and even decimal parsing in some tools. If you parse command output or rely on deterministic ordering, set LC_ALL to a known value:
bash
export LC_ALL=C
For PATH, do not assume interactive shell values. Cron and systemd often use a minimal PATH, and a script that works interactively can fail in scheduled runs. Either set PATH explicitly or call tools with absolute paths when appropriate:
bash
export PATH=/usr/sbin:/usr/bin:/sbin:/bin
For permissions, set a reasonable umask if you create files that should not be world-readable by default:
bash
umask 027
These are small lines, but they reduce “works in my terminal, fails in automation” incidents.
Design your script as a tool: interface, help, and exit codes
Most operational scripts end up being run by humans and schedulers. A good interface reduces mistakes and enables safe automation.
Start with consistent conventions:
- Use flags (
--dry-run,--force,--host,--timeout) rather than positional arguments when the meaning is non-trivial. - Provide
-h/--helpthat documents intent, required permissions, and examples. - Use meaningful exit codes:
0for success, non-zero for failure. Consider reserving different codes for different error classes when it helps automation.
A minimal, maintainable argument parser for Bash can be built with a while loop and case. It’s not as feature-rich as getopt in every distro, but it’s predictable:
bash
usage() {
cat <<'EOF'
Usage: rotate-app-logs.sh [--dry-run] --dir DIR --keep N
Rotates and compresses application logs in DIR, keeping N archives.
Options:
--dir DIR Log directory to manage (required)
--keep N Number of rotated archives to keep (required)
--dry-run Print actions without modifying files
-h, --help Show this help
EOF
}
DRY_RUN=0
LOG_DIR=
KEEP=
while [[ $# -gt 0 ]]; do
case "$1" in
--dir) LOG_DIR=${2-}; shift 2 ;;
--keep) KEEP=${2-}; shift 2 ;;
--dry-run) DRY_RUN=1; shift ;;
-h|--help) usage; exit 0 ;;
--) shift; break ;;
*) echo "Unknown argument: $1" >&2; usage; exit 2 ;;
esac
done
[[ -n ${LOG_DIR} ]] || { echo "--dir is required" >&2; exit 2; }
[[ -n ${KEEP} ]] || { echo "--keep is required" >&2; exit 2; }
This style is verbose by design: it fails fast, avoids implicit shifts, and makes required inputs explicit. That’s a recurring theme in Linux shell scripting best practices—choose clarity over cleverness.
Structure: functions, a main entrypoint, and readable flow
A shell script that grows beyond ~30–50 lines benefits from explicit structure. Without it, small changes become risky because control flow is implicit and variables leak across the entire file.
A pragmatic pattern is:
- top: environment and safety (
set -Eeuo pipefail,IFS,PATH) - then: constants and defaults
- then: functions (small, single-purpose)
- then:
main()that orchestrates - bottom:
main "$@"
Example skeleton:
bash
#!/usr/bin/env bash
set -Eeuo pipefail
IFS=$'\n\t'
export LC_ALL=C
export PATH=/usr/sbin:/usr/bin:/sbin:/bin
umask 027
log() { printf '%s %s\n' "$(date -Is)" "$*" >&2; }
die() { log "ERROR: $*"; exit 1; }
main() {
parse_args "$@"
preflight
run
}
main "$@"
This doesn’t add complexity; it creates safe seams. Those seams matter when you later add features like --dry-run, more inputs, or different execution modes.
Quoting and word splitting: treat data as data
If there is one category of bugs that repeatedly causes outages or data loss in shell scripts, it’s unintended word splitting and glob expansion.
In shell, unquoted variables undergo word splitting and pathname expansion:
bash
rm $file
# dangerous if $file contains spaces or globs
Safer is almost always:
bash
rm -- "$file"
The -- tells many commands “end of options,” preventing a filename beginning with - from being interpreted as a flag.
Quoting is not just about spaces. It’s also about values like *, ?, [ which have glob meaning. If a script takes user input (even from a trusted operator), unquoted expansion can turn input into patterns.
When you intentionally want splitting (for example, iterating fields), make it explicit and local. Avoid relying on global IFS changes mid-script because it creates non-obvious coupling.
Prefer arrays in Bash for argument lists
In Bash, arrays are the cleanest way to build command lines without re-parsing strings. This is especially important when arguments can contain spaces.
bash
args=(--archive --verbose)
[[ $DRY_RUN -eq 1 ]] && args+=(--dry-run)
rsync "${args[@]}" -- "$src/" "$dst/"
Avoid constructing a string like args="--archive --verbose" and then doing rsync $args ...; that reintroduces splitting bugs.
If you must support POSIX sh, you don’t have arrays. In that case, keep argument sets simple and avoid complex dynamic assembly.
Error handling: explicit checks, traps, and meaningful failures
Strict mode helps, but it’s not a complete error-handling strategy. Some commands intentionally return non-zero for non-error conditions (for example, grep returns 1 when no matches are found). Some errors are “expected” and should be handled.
A good approach is:
- Let unexpected failures abort the script.
- Explicitly handle expected non-zero statuses.
- Centralize messaging so operators can understand what failed.
Use traps to add context and cleanup
Traps are a practical way to add context on failure and ensure temporary resources are cleaned up.
bash
tmpdir=
cleanup() {
[[ -n ${tmpdir:-} && -d ${tmpdir:-} ]] && rm -rf -- "$tmpdir"
}
on_err() {
local exit_code=$?
local line_no=${BASH_LINENO[0]:-}
log "Failed at line ${line_no} with exit code ${exit_code}"
exit "$exit_code"
}
trap cleanup EXIT
trap on_err ERR
tmpdir=$(mktemp -d)
This pattern matters when scripts manipulate configuration files, download artifacts, or build intermediate lists. Cleanup reduces the chance that partial state affects later runs.
Be careful with trap ERR interactions: subshells, set -e, and conditional contexts can change when ERR triggers. The intent here is not to rely solely on traps, but to use them as a safety net with better diagnostics.
Handle expected non-zero results without disabling safety globally
A frequent anti-pattern is turning off set -e around blocks, which can hide real errors. Instead, handle specific commands:
bash
if grep -q "^${user}:" /etc/passwd; then
log "User exists: $user"
else
log "User missing: $user"
fi
Or capture the status explicitly:
bash
if ! systemctl is-active --quiet myservice; then
log "myservice is not active; attempting start"
systemctl start myservice
fi
This keeps the overall script strict while acknowledging that not every non-zero is fatal.
Logging: make scripts operable under cron and systemd
A script is operationally useful only if you can tell what it did. Logging is also how you avoid re-running risky actions because you can’t confirm state.
For most sysadmin scripts, log to stderr (so stdout can be used for data pipelines) and prefix with timestamps. If the script is used under systemd, journald will capture stderr and add metadata.
A simple logger:
bash
log() {
local level=${1:-INFO}
shift || true
printf '%s %-5s %s\n' "$(date -Is)" "$level" "$*" >&2
}
When you perform side-effect actions (delete files, restart services, apply firewall rules), log the intent before the action and log the result after. This is not about verbosity; it’s about producing an audit trail that allows incident responders to reconstruct events.
Real-world scenario 1: a cron job that “worked” but broke log retention
Consider an estate where an app writes logs to /var/log/myapp/ and a cron job compresses old logs. The initial script did find /var/log/myapp -name *.log -mtime +7 -exec gzip {} \;. It “worked” until a directory also contained files like myapp.log.1 and error.log.2025-01-01.
Because *.log was unquoted, the shell expanded it in the current working directory (cron often starts in $HOME or /). On some hosts it expanded to nothing and find got -name with an empty argument, causing an error; on other hosts it expanded to a literal list of files that happened to exist in the cron user’s home, leading to unpredictable results.
The fix wasn’t complicated; it was discipline:
bash
find "$LOG_DIR" -type f -name '*.log' -mtime +7 -print0 \
| xargs -0 -r gzip --
Along with logging what’s being changed and a --dry-run mode, the script became predictable. The important lesson is that quoting and operability go together—when you log the exact command intent and protect expansions, you avoid “it depends where cron ran it” outcomes.
Idempotency and safe re-runs: design for retries
In operations, scripts get re-run: a maintenance window is interrupted, a host reboots mid-run, or a pipeline retries after a transient failure. A script that assumes a clean slate can make recovery worse.
Idempotency means you can run the script multiple times and reach the same desired state without causing unintended side effects. Shell scripts can be idempotent if they check state before changing it.
For example, rather than blindly appending a sysctl setting:
bash
# Avoid: duplicates on every run
echo 'net.ipv4.ip_forward=1' >> /etc/sysctl.conf
Prefer ensuring the line exists exactly once (using a tool appropriate to your environment). On many Linux systems, sysctl.d drop-ins are safer than editing a monolithic file:
bash
conf=/etc/sysctl.d/99-forwarding.conf
printf '%s\n' 'net.ipv4.ip_forward=1' > "$conf"
sysctl -p "$conf"
Now a re-run overwrites the same file and re-applies the setting. This is simpler to reason about and easier to audit.
When idempotency isn’t possible (for example, a one-time data migration), add explicit guards: create a marker file, record a version stamp, or require --force.
Input validation: fail early and explain what’s wrong
Scripts often take input from operators, CI variables, inventory files, or remote APIs. Validation prevents your script from becoming a command execution engine.
Validation is not only about security; it’s about correctness. For example:
- Ensure directories exist and are directories.
- Ensure numeric inputs are numeric and within bounds.
- Ensure hostnames match expected patterns.
- Ensure required binaries are present.
In Bash, validate numbers carefully:
bash
is_uint() { [[ $1 =~ ^[0-9]+$ ]]; }
if ! is_uint "$KEEP" || [[ $KEEP -lt 1 || $KEEP -gt 365 ]]; then
die "--keep must be an integer between 1 and 365"
fi
Validate paths to avoid dangerous values:
bash
[[ -d "$LOG_DIR" ]] || die "Log directory not found: $LOG_DIR"
[[ "$LOG_DIR" != "/" ]] || die "Refusing to operate on /"
That last check looks obvious until you see a script where LOG_DIR is derived from a variable that can be empty under set -u exceptions or poorly handled defaults. Guardrails should be cheap and explicit.
Dependencies and preflight checks: make failure modes obvious
When a script fails due to a missing tool, wrong permissions, or unsupported platform, you want it to fail immediately with a clear message.
A preflight function should check:
- Required commands exist (
command -v) - Required privileges (root or capabilities)
- OS/distro constraints if relevant
- Network reachability if the script depends on it
- Disk space when writing large files
Example:
bash
need_cmd() { command -v "$1" >/dev/null 2>&1 || die "Missing required command: $1"; }
preflight() {
need_cmd find
need_cmd gzip
need_cmd xargs
[[ -w "$LOG_DIR" ]] || die "No write permission on $LOG_DIR"
}
This is also where you should confirm whether you’re running under cron/systemd with a minimal environment and adjust as needed (PATH already pinned, locale set).
Filesystem safety: temporary files, atomic writes, and locking
Many scripts edit configuration files, generate reports, or rotate logs. Filesystem operations are where partial failures can leave corrupted state.
Use mktemp and clean up
Never use predictable temp filenames like /tmp/file.tmp. Use mktemp to avoid collisions and symlink attacks.
bash
tmpfile=$(mktemp)
trap 'rm -f -- "$tmpfile"' EXIT
Write atomically when updating files
If you generate a new version of a file, write to a temporary file in the same filesystem and then mv it into place. On the same filesystem, mv is atomic.
bash
out=/etc/myapp/allowlist.conf
new=$(mktemp "${out}.XXXX")
generate_allowlist > "$new"
chmod 0640 "$new"
chown root:myapp "$new"
mv -f -- "$new" "$out"
This prevents readers from seeing half-written files and makes rollback easier.
Use locks for scripts that can overlap
Cron, systemd timers, and manual runs can overlap. Overlap can corrupt state (double rotation, concurrent downloads, parallel edits).
On most Linux systems, flock is the simplest locking mechanism:
bash
lock=/var/lock/rotate-myapp-logs.lock
exec 9>"$lock"
flock -n 9 || die "Another instance is running"
This ensures only one instance runs at a time. If you need to support environments without flock, you can use lock directories (mkdir is atomic) but you then need robust stale-lock handling.
Parsing command output: avoid it when possible, and be defensive when you must
Shell scripts often glue tools together, which tempts you to parse “human output” from commands. That output can change across versions, locales, and flags.
Prefer machine-friendly interfaces when available:
- Use
--jsonoutput options where tools provide them. - Use stable formats like
key=value. - Use
statwith custom formatting.
When you must parse, constrain and validate. For example, when enumerating files, do not parse ls. Use find with null delimiters:
bash
find "$LOG_DIR" -type f -name '*.log' -print0
Similarly, when processing lines, use read -r to avoid backslash escapes and preserve content:
bash
while IFS= read -r line; do
# process $line
:
done < "$file"
Real-world scenario 2: parsing df output caused false disk alarms
A team wrote a health-check script that ran df -h and extracted the “Use%” column with awk '{print $5}'. On a subset of systems, df output included additional mount points with spaces in the mount name (bind mounts with labels, certain container setups), shifting columns and producing invalid percentages. The script then paged on “disk usage 100%” because it parsed the wrong field.
A more robust approach was to query df in a predictable format and select the filesystem of interest explicitly:
bash
df -P -- "$mount" | awk 'NR==2 {gsub(/%/,"",$5); print $5}'
The -P POSIX format stabilizes columns, and restricting to the mount avoids surprises. Even better in some environments was using stat -f or reading from /proc/mounts and calculating usage per filesystem, but the key best practice is the same: if you parse, make the output format deterministic and validate what you extracted.
Use the right tool for text processing (and keep it readable)
Shell scripts are strongest when orchestrating other tools. Don’t force everything into shell parameter expansion if it reduces clarity. Use:
awkfor structured line-based processing.sedfor simple substitutions.grepwith anchors for matching.cutwhen fields are truly delimiter-separated.
At the same time, avoid pipelines that are hard to debug. If a pipeline becomes too dense, break it into named steps and log intermediate results when running in verbose mode.
A maintainable pattern is to keep transformations in functions that can be unit-tested with sample input.
Exit codes and error messages: be useful to automation and humans
Scripts often run under:
- cron/systemd timers
- CI/CD pipelines
- orchestrators (Ansible calling scripts, Kubernetes jobs)
Those systems usually care only about exit code and logs. Make both meaningful.
Use exit 2 for usage errors (bad flags, missing arguments), and exit 1 (or higher) for runtime failures. If you integrate with monitoring, stable exit codes allow you to classify alerts.
Error messages should state:
- what failed
- what the script was trying to do
- what the operator can check (path, permissions, missing dependency)
Avoid messages like “failed” with no context. Prefer die "gzip failed for $file".
Security: least privilege, safe handling of secrets, and controlled execution
Security best practices in shell scripting are often framed as “don’t do X,” but operationally you need actionable patterns.
Run with the minimum required privileges
If only one operation requires root, consider splitting the script into a privileged helper and an unprivileged controller, or use sudo for the specific command. At minimum, check effective UID when root is required:
bash
[[ ${EUID:-$(id -u)} -eq 0 ]] || die "This script must be run as root"
Avoid running everything as root “because it’s easier.” Many mistakes become catastrophes under root.
Avoid eval and untrusted code execution
eval turns data into code. If any part of the evaluated string comes from user input, environment variables, or files, it becomes an injection vector.
If you feel you “need” eval to build a command dynamically, step back and use arrays (Bash) or a case statement. In POSIX sh, consider redesigning the interface rather than evaluating strings.
Handle secrets carefully
Common pitfalls:
- passing tokens on the command line (visible in
ps) - writing secrets into logs
- storing secrets in world-readable temp files
Prefer reading secrets from protected files or environment variables injected by a secret manager, and avoid printing them. If you must pass a secret to a command, use stdin or a file descriptor where supported.
For example, for tools that can read a password from stdin, avoid --password flags. If a tool only accepts a flag, consider whether shell is the right integration mechanism.
Also be aware that set -x (xtrace) will print commands and their expanded arguments. If you use xtrace for debugging, enable it only conditionally and never in production runs that may handle secrets.
Portability across distributions and environments
A script that runs on Ubuntu 22.04 may fail on RHEL 8, Alpine, or inside a minimal container. Portability is not “write once run anywhere,” but you can make informed choices.
Key areas:
- Tool differences (
sed -ibehavior,grep -Pavailability,tarflags) - System service management (systemd vs others; in modern enterprise Linux it’s usually systemd)
- Paths (
/usr/binvs/binmerges; differences inipvsifconfigavailability)
When portability matters, pin assumptions:
- Use
#!/usr/bin/env bashonly if Bash is guaranteed. - Use POSIX options (
df -Prather than relying on default output). - Prefer
printfoverechofor predictable escape behavior.
If you operate mixed fleets, include platform detection sparingly and clearly:
bash
os_id=
if [[ -r /etc/os-release ]]; then
# shellcheck disable=SC1091
. /etc/os-release
os_id=${ID:-}
fi
Then branch only when necessary (for example, package manager differences). Too many distro branches can make a script unmaintainable; at that point, configuration management tools may be a better fit.
Concurrency and remote execution: SSH fan-out without chaos
Many sysadmin scripts run commands across multiple hosts. The shell makes it easy to write loops over ssh, but it’s also easy to create fragile automation.
Principles:
- Make remote commands explicit and quote them carefully.
- Limit concurrency to avoid saturating networks or central services.
- Capture per-host results and exit codes.
A simple sequential approach is often sufficient and easier to troubleshoot:
bash
while IFS= read -r host; do
log INFO "Checking $host"
if ssh -o BatchMode=yes -o ConnectTimeout=5 -- "$host" 'systemctl is-active --quiet myservice'; then
log INFO "$host: active"
else
log WARN "$host: not active"
fi
done < hosts.txt
If you add parallelism, do it intentionally (for example, with xargs -P or GNU parallel where available) and ensure your logging includes host context.
Real-world scenario 3: a restart script triggered a cascading outage
An engineer wrote a script to restart a service across 400 nodes during a maintenance window. The script ran restarts in parallel using background jobs without limits. Each restart caused the service to reconnect to a shared database and warm caches. The sudden thundering herd overloaded the database, causing failures across the fleet, which the script interpreted as “restart failed” and retried, compounding the load.
The remediation was to apply operational best practices in the script:
- limit concurrency (restart 10 at a time)
- add jitter between batches
- treat certain failures as “stop and investigate” rather than “retry immediately”
- log per-host outcomes to a file for audit
Even without introducing heavy tooling, a small concurrency control with xargs -P and a careful restart function reduced risk dramatically. This is a reminder that scripting best practices aren’t just syntax—they are about designing safe operational behavior.
Scheduling: cron vs systemd timers and environment consistency
Where a script runs matters as much as what it does.
Cron is ubiquitous and simple, but it provides a minimal environment and coarse logging unless redirected. Systemd timers offer better control over environment, dependencies, and logging to journald.
If you use cron, explicitly set PATH and redirect output:
cron
PATH=/usr/sbin:/usr/bin:/sbin:/bin
MAILTO=""
15 2 * * * /usr/local/sbin/rotate-app-logs.sh --dir /var/log/myapp --keep 14 >>/var/log/rotate-myapp.log 2>&1
If you use systemd timers, you can rely on journald and specify a clean environment in the unit file. Even then, keep your script self-sufficient (PATH, locale) so it’s reliable when run manually or by other automation.
Testing and linting: treat scripts like code, not terminal history
Shell scripts are notoriously easy to “get mostly right” and hard to make reliably correct. Testing and linting are how you catch the edge cases you didn’t think about.
ShellCheck: the highest ROI tool for shell scripts
ShellCheck is a static analyzer that catches common bugs: unquoted variables, unreachable code, bad test syntax, and portability issues. It’s not perfect, but it’s exceptionally effective.
Run it locally and in CI:
bash
shellcheck -x ./rotate-app-logs.sh
The -x allows following sourced files. Don’t blindly silence warnings; understand them and either fix the issue or add a comment explaining why the warning is not applicable.
Simple test hooks: dry-run mode and dependency injection
Full unit testing in shell is possible (with frameworks like Bats), but even without adopting a framework, you can make scripts easier to validate.
A --dry-run mode that prints actions instead of performing them is extremely useful. Implement it as a small wrapper:
bash
run_cmd() {
if [[ $DRY_RUN -eq 1 ]]; then
log INFO "DRY-RUN: $*"
else
"$@"
fi
}
run_cmd rm -- "$file"
This also creates a single choke point to add logging, timing, or retries later.
For scripts that call external binaries, consider allowing overrides for testing:
bash
FIND_BIN=${FIND_BIN:-find}
"$FIND_BIN" "$LOG_DIR" -type f -name '*.log'
In production you leave defaults; in tests you can inject a stub.
Integration tests: validate in containers
For portability across distros, containers provide a lightweight way to validate behavior. Even a minimal “smoke test” that runs your script under Ubuntu and Rocky Linux images can catch assumptions about tools, paths, and flags.
The best practice here is less about writing a complex test suite and more about establishing a habit: run scripts in a clean environment before deploying broadly.
Performance and scalability: avoid unnecessary forks and big loops
Shell isn’t a high-performance language, but many scripts are fast enough when written sensibly. The common performance killer is spawning thousands of subprocesses in a loop.
For example, doing for f in ...; do grep ...; done on thousands of files can be slow. Prefer using tools that operate on multiple inputs in one invocation, or use find -exec ... + to batch:
bash
find "$LOG_DIR" -type f -name '*.log' -mtime +7 -exec gzip -- {} +
Similarly, when you need to process many lines, avoid cat file | while read ... patterns that create subshell issues and add overhead. Read directly from the file.
Also consider how your script behaves on large directories: using null-delimited streams with -print0 avoids pathological behavior when encountering unusual filenames.
Configuration: prefer files and explicit defaults over ad-hoc edits
As scripts evolve, hard-coded values become liabilities. At the same time, supporting too many knobs can make a script harder to operate.
A balanced approach:
- Provide sensible defaults in the script.
- Allow overrides via flags.
- Optionally support a config file for stable environments.
If you use a config file, keep it simple (key=value) and validate values after loading. Avoid sourcing arbitrary files unless you control them, because sourcing executes code.
A safer pattern is to parse key=value with limited rules, but that can become its own mini-parser. If you do source a config file, ensure permissions are strict and the file is in a trusted path.
Documentation inside the script: comments that explain “why”
Comments should not narrate what the code is obviously doing; they should explain intent, constraints, and decisions.
Good examples:
- why you chose a specific flag because of distro differences
- why a retry exists and what failure it mitigates
- what invariants are assumed (must be root, must run on systemd hosts)
Avoid extensive inline commentary that duplicates command names. Instead, prefer a clear function name and a short comment where needed.
A brief header block is often enough:
bash
# Purpose: Rotate and compress MyApp logs safely under cron/systemd.
# Requires: bash, find, gzip, xargs
# Safety: uses flock to avoid overlaps; supports --dry-run.
A cohesive example: a safer log rotation script pattern
To tie together the practices discussed so far—strict mode, validation, locking, safe iteration, and operability—here is a compact but production-oriented pattern. This is not meant to replace logrotate where it fits, but it reflects how to write a reliable script when you need custom behavior.
bash
#!/usr/bin/env bash
set -Eeuo pipefail
IFS=$'\n\t'
export LC_ALL=C
export PATH=/usr/sbin:/usr/bin:/sbin:/bin
umask 027
usage() {
cat <<'EOF'
Usage: rotate-app-logs.sh --dir DIR --keep N [--dry-run]
Rotates *.log in DIR by compressing files older than 7 days and pruning
archives beyond N per base filename.
EOF
}
log() { printf '%s %-5s %s\n' "$(date -Is)" "${1:-INFO}" "${2:-}" >&2; }
die() { log ERROR "$*"; exit 1; }
run_cmd() {
if [[ $DRY_RUN -eq 1 ]]; then
log INFO "DRY-RUN: $*"
else
"$@"
fi
}
need_cmd() { command -v "$1" >/dev/null 2>&1 || die "Missing required command: $1"; }
parse_args() {
DRY_RUN=0
LOG_DIR=
KEEP=
while [[ $# -gt 0 ]]; do
case "$1" in
--dir) LOG_DIR=${2-}; shift 2 ;;
--keep) KEEP=${2-}; shift 2 ;;
--dry-run) DRY_RUN=1; shift ;;
-h|--help) usage; exit 0 ;;
*) die "Unknown argument: $1" ;;
esac
done
[[ -n ${LOG_DIR} ]] || die "--dir is required"
[[ -d ${LOG_DIR} ]] || die "Not a directory: $LOG_DIR"
[[ ${LOG_DIR} != "/" ]] || die "Refusing to operate on /"
[[ ${KEEP} =~ ^[0-9]+$ ]] || die "--keep must be an integer"
[[ $KEEP -ge 1 && $KEEP -le 365 ]] || die "--keep out of range"
}
preflight() {
need_cmd find
need_cmd gzip
need_cmd xargs
need_cmd flock
[[ -w "$LOG_DIR" ]] || die "No write permission: $LOG_DIR"
}
with_lock() {
local lock=/var/lock/rotate-app-logs.lock
exec 9>"$lock"
flock -n 9 || die "Another instance is running"
}
compress_old_logs() {
log INFO "Compressing .log files older than 7 days in $LOG_DIR"
# Null-delimited to handle all valid filenames safely.
if [[ $DRY_RUN -eq 1 ]]; then
find "$LOG_DIR" -type f -name '*.log' -mtime +7 -print
else
find "$LOG_DIR" -type f -name '*.log' -mtime +7 -print0 | xargs -0 -r gzip --
fi
}
main() {
parse_args "$@"
preflight
with_lock
compress_old_logs
log INFO "Done"
}
main "$@"
This example stops short of implementing per-base-name retention because that logic can quickly become bespoke (and many environments are better served by logrotate). What matters is the pattern: safe expansion, clear arguments, predictable behavior under cron/systemd, and lock-based overlap protection.
As you adapt it for your own needs, keep the earlier sections in mind: if you add pruning, ensure it’s idempotent; if you parse filenames, use null delimiters; if you introduce remote operations, log per-host results and limit concurrency.
Maintainability over time: versioning, change control, and safe deployment
Shell scripts that matter should be treated like other infrastructure code.
Put scripts in version control. Require review for changes that affect production. Tag releases when scripts are deployed broadly. Include a --version flag when it helps operators confirm what is running.
When deploying changes, aim for safe rollout:
- test in a non-production environment (or subset of hosts)
- run in
--dry-runfirst where possible - stage changes with feature flags (
--enable-prune)
Also be cautious about modifying scripts in place on live systems without tracking: it makes incident response harder because you cannot easily reconstruct what logic ran.
When shell is the wrong tool
Shell scripting is excellent for orchestration, but some problems become brittle in shell:
- complex data structures (nested JSON, state machines)
- heavy concurrency and retries with backoff
- robust API integrations with authentication flows
- large-scale text parsing where correctness is critical
Recognizing this is itself a best practice. If a script is becoming a small program with significant logic, moving to Python/Go (or using configuration management) can reduce risk. A practical heuristic is: if you’re implementing your own parser, scheduler, or mini-database in shell, it’s time to reassess.
That said, the practices in this guide still apply: clear interfaces, safety defaults, explicit dependencies, and operability are language-agnostic. Applying them in shell is how you keep your automation dependable even when it starts as “just a script.”