Skip to main content
Version: Unreleased

General Guidelines

This document collects practical operational and engineering guidelines for Elasticsearch beyond basic sizing: security, backups and recovery, index and mapping best practices, performance tuning, cluster operations, monitoring, and common troubleshooting steps.

Security

  • Use TLS for all node-to-node and client-to-node communication. Protect transport and HTTP layers.
  • Enable authentication and role-based access control (RBAC). Avoid using the default built-in superuser for application access.
  • Apply the principle of least privilege: create roles that only allow the actions necessary for a user/service.
  • Rotate certificates and credentials regularly and automate rotation where possible.
  • Audit access and actions using Elasticsearch auditing features or an external audit pipeline.
  • Keep Elasticsearch and plugin versions up to date to receive security fixes.

References:

Backups, snapshots, and disaster recovery

  • Use snapshots to an external repository (S3, GCS, NFS) for backups. Snapshots are incremental and are best used alongside a tested restore plan.
  • Keep periodic snapshot schedules and test restores regularly — a snapshot is only useful if it can be restored.
  • Consider Snapshot Lifecycle Management (SLM) to automate snapshots and retention.
  • For multi-cluster DR, consider cross-cluster replication (CCR) for near-real-time replication of critical indices.
  • Ensure snapshot repositories have sufficient throughput and access controls.

References:

Index design and mappings

  • Design mappings explicitly. Avoid relying on dynamic mappings in production — dynamic types can cause index explosion and mapping conflicts.
  • Use index templates (and composable templates) to ensure consistent mappings, settings, and aliases for time-based indices.
  • Choose appropriate field types and analyzers. Avoid using text with default analyzers for fields that will be used for exact-match filters — use keyword or keyword subfields.
  • Limit the number of fields: many fields increase memory usage (fielddata, doc values) and mapping complexity.
  • Use doc_values for fields used in aggregations and sorting (enabled by default for most fields). Disable index or doc_values where not needed.
  • Avoid large nested structures when possible; consider denormalization or parent/child for specific use cases.

References:

Performance tuning and indexing best practices

  • Bulk API: use the bulk API for indexing to minimize overhead. Tune bulk sizes to balance throughput and memory (typical payloads vary; test 5–50MB as a starting point).
  • Refresh interval: increase refresh_interval during heavy bulk indexing (set to -1 to disable then restore) to reduce refresh overhead.
  • Translog durability: for very high ingest, consider index.translog.durability and sync_interval settings carefully; be aware of durability trade-offs.
  • Merge and segment tuning: monitor merges and adjust index.merge.policy only when you have merge-related performance issues — merges are complex and risky to tune without measurement.
  • Request and query cache: use request cache for repetitive, cacheable queries; be mindful of cache sizes.
  • Thread pools and queue sizes: monitor thread pools (search, write, get) and tune queues only if you encounter rejections, after analyzing causes.
  • Circuit breakers: do not disable circuit breakers. They protect the cluster from out-of-memory conditions.

References:

Cluster operations and lifecycle

  • Rolling upgrades: follow the official rolling upgrade procedures to upgrade nodes with minimal downtime.
  • Bootstrap checks: when running in production mode, ensure bootstrap checks pass (heap, file descriptors, mmap settings, etc.).
  • Shard allocation awareness: use allocation awareness and forced awareness attributes for racks/azs to improve resiliency.
  • Disk-based shard allocation: configure cluster.routing.allocation.disk.* deciders to avoid filling nodes completely.
  • Node roles: isolate roles where appropriate (master-only, data-hot, data-warm, ingest, coordinating) to reduce resource contention.

References:

Monitoring, logging and alerting

  • Collect metrics (JVM, OS, disk, network, index/search metrics) and send them to a monitoring cluster or an APM/metrics system.
  • Enable slow logs (index.search.slowlog and index.indexing.slowlog) to capture slow queries and slow indexing operations.
  • Monitor GC pauses and heap usage; long GC pauses can cause node instability.
  • Alert on cluster health changes (red/yellow), node restarts, high disk usage, and high queue rejection rates.

References:

Maintenance and troubleshooting

  • Test restores from snapshots periodically and document the restore steps and timelines.
  • When shards are relocating for long periods, inspect network, disk I/O, and indexing pressure; consider throttling reallocation if needed.
  • Keep an eye on file descriptors and ensure ulimits meet bootstrap check recommendations.
  • For memory pressure issues, inspect large fielddata usage, circuit breaker events, and excessive aggregations.

Kubernetes and containerized deployments

  • When running on Kubernetes, use the official Elasticsearch Helm charts or operators which handle lifecycle and settings.
  • Ensure persistent volumes are backed by fast storage and have proper access modes.
  • Configure pod anti-affinity to ensure nodes do not co-locate on the same k8s node unless intended.

References:

Suggested defaults and examples

  • Default JVM heap: min(physicalRAM/2, 31GB)
  • Avoid disk usage > 70–80% per node
  • Start with replicas=1 for HA and increase read throughput by adding replica copies where needed
  • For time-series indices, use ILM to automate rollover and movement across tiers