Version: Unreleased

General Guidelines

This document collects practical operational and engineering guidelines for Elasticsearch beyond basic sizing: security, backups and recovery, index and mapping best practices, performance tuning, cluster operations, monitoring, and common troubleshooting steps.

Security

Use TLS for all node-to-node and client-to-node communication. Protect transport and HTTP layers.
Enable authentication and role-based access control (RBAC). Avoid using the default built-in superuser for application access.
Apply the principle of least privilege: create roles that only allow the actions necessary for a user/service.
Rotate certificates and credentials regularly and automate rotation where possible.
Audit access and actions using Elasticsearch auditing features or an external audit pipeline.
Keep Elasticsearch and plugin versions up to date to receive security fixes.

References:

Configure TLS

Backups, snapshots, and disaster recovery

Use snapshots to an external repository (S3, GCS, NFS) for backups. Snapshots are incremental and are best used alongside a tested restore plan.
Keep periodic snapshot schedules and test restores regularly — a snapshot is only useful if it can be restored.
Consider Snapshot Lifecycle Management (SLM) to automate snapshots and retention.
For multi-cluster DR, consider cross-cluster replication (CCR) for near-real-time replication of critical indices.
Ensure snapshot repositories have sufficient throughput and access controls.

References:

Index design and mappings

Design mappings explicitly. Avoid relying on dynamic mappings in production — dynamic types can cause index explosion and mapping conflicts.
Use index templates (and composable templates) to ensure consistent mappings, settings, and aliases for time-based indices.
Choose appropriate field types and analyzers. Avoid using text with default analyzers for fields that will be used for exact-match filters — use keyword or keyword subfields.
Limit the number of fields: many fields increase memory usage (fielddata, doc values) and mapping complexity.
Use doc_values for fields used in aggregations and sorting (enabled by default for most fields). Disable index or doc_values where not needed.
Avoid large nested structures when possible; consider denormalization or parent/child for specific use cases.

References:

Performance tuning and indexing best practices

Bulk API: use the bulk API for indexing to minimize overhead. Tune bulk sizes to balance throughput and memory (typical payloads vary; test 5–50MB as a starting point).
Refresh interval: increase refresh_interval during heavy bulk indexing (set to -1 to disable then restore) to reduce refresh overhead.
Translog durability: for very high ingest, consider index.translog.durability and sync_interval settings carefully; be aware of durability trade-offs.
Merge and segment tuning: monitor merges and adjust index.merge.policy only when you have merge-related performance issues — merges are complex and risky to tune without measurement.
Request and query cache: use request cache for repetitive, cacheable queries; be mindful of cache sizes.
Thread pools and queue sizes: monitor thread pools (search, write, get) and tune queues only if you encounter rejections, after analyzing causes.
Circuit breakers: do not disable circuit breakers. They protect the cluster from out-of-memory conditions.

References:

Cluster operations and lifecycle

Rolling upgrades: follow the official rolling upgrade procedures to upgrade nodes with minimal downtime.
Bootstrap checks: when running in production mode, ensure bootstrap checks pass (heap, file descriptors, mmap settings, etc.).
Shard allocation awareness: use allocation awareness and forced awareness attributes for racks/azs to improve resiliency.
Disk-based shard allocation: configure cluster.routing.allocation.disk.* deciders to avoid filling nodes completely.
Node roles: isolate roles where appropriate (master-only, data-hot, data-warm, ingest, coordinating) to reduce resource contention.

References:

Monitoring, logging and alerting

Collect metrics (JVM, OS, disk, network, index/search metrics) and send them to a monitoring cluster or an APM/metrics system.
Enable slow logs (index.search.slowlog and index.indexing.slowlog) to capture slow queries and slow indexing operations.
Monitor GC pauses and heap usage; long GC pauses can cause node instability.
Alert on cluster health changes (red/yellow), node restarts, high disk usage, and high queue rejection rates.

References:

Maintenance and troubleshooting

Test restores from snapshots periodically and document the restore steps and timelines.
When shards are relocating for long periods, inspect network, disk I/O, and indexing pressure; consider throttling reallocation if needed.
Keep an eye on file descriptors and ensure ulimits meet bootstrap check recommendations.
For memory pressure issues, inspect large fielddata usage, circuit breaker events, and excessive aggregations.

Kubernetes and containerized deployments

When running on Kubernetes, use the official Elasticsearch Helm charts or operators which handle lifecycle and settings.
Ensure persistent volumes are backed by fast storage and have proper access modes.
Configure pod anti-affinity to ensure nodes do not co-locate on the same k8s node unless intended.

References:

Run Elasticsearch on Kubernetes

IPv6

To enable IPv6 for an Elasticsearch container, set the following environment variable when starting the container:

-e ES_JAVA_OPTS="-Djava.net.preferIPv6Addresses=true" \

You can also add this to a Docker Compose service under the environment section.

Suggested defaults and examples

Default JVM heap: min(physicalRAM/2, 31GB)
Avoid disk usage > 70–80% per node
Start with replicas=1 for HA and increase read throughput by adding replica copies where needed
For time-series indices, use ILM to automate rollover and movement across tiers

Security​

Backups, snapshots, and disaster recovery​

Index design and mappings​

Performance tuning and indexing best practices​

Cluster operations and lifecycle​

Monitoring, logging and alerting​

Maintenance and troubleshooting​

Kubernetes and containerized deployments​

IPv6​

Suggested defaults and examples​

Security

Backups, snapshots, and disaster recovery

Index design and mappings

Performance tuning and indexing best practices

Cluster operations and lifecycle

Monitoring, logging and alerting

Maintenance and troubleshooting

Kubernetes and containerized deployments

IPv6

Suggested defaults and examples