General Guidelines
This document collects practical operational and engineering guidelines for Elasticsearch beyond basic sizing: security, backups and recovery, index and mapping best practices, performance tuning, cluster operations, monitoring, and common troubleshooting steps.
Security
- Use TLS for all node-to-node and client-to-node communication. Protect transport and HTTP layers.
- Enable authentication and role-based access control (RBAC). Avoid using the default built-in superuser for application access.
- Apply the principle of least privilege: create roles that only allow the actions necessary for a user/service.
- Rotate certificates and credentials regularly and automate rotation where possible.
- Audit access and actions using Elasticsearch auditing features or an external audit pipeline.
- Keep Elasticsearch and plugin versions up to date to receive security fixes.
References:
Backups, snapshots, and disaster recovery
- Use snapshots to an external repository (S3, GCS, NFS) for backups. Snapshots are incremental and are best used alongside a tested restore plan.
- Keep periodic snapshot schedules and test restores regularly — a snapshot is only useful if it can be restored.
- Consider Snapshot Lifecycle Management (SLM) to automate snapshots and retention.
- For multi-cluster DR, consider cross-cluster replication (CCR) for near-real-time replication of critical indices.
- Ensure snapshot repositories have sufficient throughput and access controls.
References:
Index design and mappings
- Design mappings explicitly. Avoid relying on dynamic mappings in production — dynamic types can cause index explosion and mapping conflicts.
- Use index templates (and composable templates) to ensure consistent mappings, settings, and aliases for time-based indices.
- Choose appropriate field types and analyzers. Avoid using
textwith default analyzers for fields that will be used for exact-match filters — usekeywordorkeywordsubfields. - Limit the number of fields: many fields increase memory usage (fielddata, doc values) and mapping complexity.
- Use
doc_valuesfor fields used in aggregations and sorting (enabled by default for most fields). Disableindexordoc_valueswhere not needed. - Avoid large nested structures when possible; consider denormalization or parent/child for specific use cases.
References:
Performance tuning and indexing best practices
- Bulk API: use the bulk API for indexing to minimize overhead. Tune bulk sizes to balance throughput and memory (typical payloads vary; test 5–50MB as a starting point).
- Refresh interval: increase
refresh_intervalduring heavy bulk indexing (set to-1to disable then restore) to reduce refresh overhead. - Translog durability: for very high ingest, consider
index.translog.durabilityandsync_intervalsettings carefully; be aware of durability trade-offs. - Merge and segment tuning: monitor merges and adjust
index.merge.policyonly when you have merge-related performance issues — merges are complex and risky to tune without measurement. - Request and query cache: use request cache for repetitive, cacheable queries; be mindful of cache sizes.
- Thread pools and queue sizes: monitor thread pools (search, write, get) and tune queues only if you encounter rejections, after analyzing causes.
- Circuit breakers: do not disable circuit breakers. They protect the cluster from out-of-memory conditions.
References:
Cluster operations and lifecycle
- Rolling upgrades: follow the official rolling upgrade procedures to upgrade nodes with minimal downtime.
- Bootstrap checks: when running in production mode, ensure bootstrap checks pass (heap, file descriptors, mmap settings, etc.).
- Shard allocation awareness: use allocation awareness and forced awareness attributes for racks/azs to improve resiliency.
- Disk-based shard allocation: configure
cluster.routing.allocation.disk.*deciders to avoid filling nodes completely. - Node roles: isolate roles where appropriate (master-only, data-hot, data-warm, ingest, coordinating) to reduce resource contention.
References:
Monitoring, logging and alerting
- Collect metrics (JVM, OS, disk, network, index/search metrics) and send them to a monitoring cluster or an APM/metrics system.
- Enable slow logs (
index.search.slowlogandindex.indexing.slowlog) to capture slow queries and slow indexing operations. - Monitor GC pauses and heap usage; long GC pauses can cause node instability.
- Alert on cluster health changes (red/yellow), node restarts, high disk usage, and high queue rejection rates.
References:
Maintenance and troubleshooting
- Test restores from snapshots periodically and document the restore steps and timelines.
- When shards are relocating for long periods, inspect network, disk I/O, and indexing pressure; consider throttling reallocation if needed.
- Keep an eye on file descriptors and ensure ulimits meet bootstrap check recommendations.
- For memory pressure issues, inspect large fielddata usage, circuit breaker events, and excessive aggregations.
Kubernetes and containerized deployments
- When running on Kubernetes, use the official Elasticsearch Helm charts or operators which handle lifecycle and settings.
- Ensure persistent volumes are backed by fast storage and have proper access modes.
- Configure pod anti-affinity to ensure nodes do not co-locate on the same k8s node unless intended.
References:
Suggested defaults and examples
- Default JVM heap:
min(physicalRAM/2, 31GB) - Avoid disk usage > 70–80% per node
- Start with replicas=1 for HA and increase read throughput by adding replica copies where needed
- For time-series indices, use ILM to automate rollover and movement across tiers