Elasticsearch Sizing Guidelines
This document gives practical guidance for sizing Elasticsearch clusters (nodes, memory, CPU, storage, and shards) and points to the official Elasticsearch documentation for deeper reference. Use these guidelines as starting points — actual sizing must be validated with realistic load tests for your data, query patterns, and retention policies.
Key principles
- Right-size heap: give Elasticsearch enough heap to operate, but avoid very large JVM heaps. Use native OS memory for filesystem cache and long-lived buffers.
- Keep shards at reasonable sizes: small shards add overhead, very large shards are harder to recover and rebalance.
- Separate concerns: consider hot/warm or cold tiers for different retention/IOPS needs.
- Measure and iterate: baseline CPU, IOPS, and indexing/query throughput with representative workloads and tune from there.
Node sizing basics
- RAM: start with
32–64GBof system RAM for data nodes in small-medium clusters; larger clusters commonly use64–256GBdepending on workload. - JVM heap: set JVM heap to 50% of physical RAM but no more than ~31GB (to keep compressed ordinary object pointers). Example: with
64GBRAM, set heap to31GB(or30g), leaving the rest for OS cache and native memory.- Rule: heap = min(physicalRAM/2, 31GB)
- CPU: choose CPUs based on query and indexing load. Many cores help for heavy concurrent queries. A baseline data node often uses
4–16vCPUs; scale up for high-throughput indexing or heavy aggregations. - Disk: prefer SSDs for data nodes. Choose capacity based on retention and expected indexing rate plus replicas and snapshots. Leave headroom—avoid filling disks above ~70–80% to allow shard reallocation.
- Network: ensure low-latency, high-throughput network between nodes (
1Gbpsminimum,10Gbpsfor larger clusters or heavy cross-node traffic).
Shard sizing and counts
- Shard size: aim for shard sizes that balance manageability and performance. A common guideline is 20–50GB per shard for many workloads; adjust by your query patterns and recovery targets. Very large shards (>100GB) make recovery and relocation slow.
- Number of shards per node: keep the total number of shards moderate (thousands of shards per cluster may be fine, but avoid excessive shard counts per node). Each shard has memory and file-descriptor overhead.
- Primary vs replicas: replicas increase read throughput and provide redundancy. Plan replica count so cluster can tolerate node failures while still serving queries.
Index design and lifecycle
- Index-per-time-bucket: if using time-series data, create indices aligned to retention and query windows (daily, weekly) to make lifecycle management easier.
- ILM (
Index Lifecycle Management): use ILM to move indices between tiers (hot->warm->cold) and to delete or freeze old indices automatically.
Hot–Warm (and cold) architecture
- Hot nodes: fast CPUs, more memory, and fast NVMe/SSDs — handle indexing and low-latency queries.
- Warm nodes: less CPU, larger, cheaper storage — for querying older data with lower performance requirements.
- Cold/Frozen: for long-term retention, reduce resource usage and accept slower queries.
Storage performance and sizing
- IOPS and throughput: base sizing on expected indexing throughput and query IO patterns. Bulk indexing may require sustained write throughput; heavy aggregations drive read IOPS.
- Snapshots: allocate network and storage throughput for snapshots. Snapshots are incremental but still require bandwidth during large retention periods.
Monitoring and metrics to watch
- JVM heap usage and GC pause times
- Node CPU and load
- Disk usage and IOPS
- Search rate, latency, and queue sizes
- Indexing rate, refresh rate, and merge times
- Cluster health and shard allocation status
Example sizing workflows
-
Small test cluster (development or low-volume):
- 3 data nodes, each:
16–32GBRAM, heap8–15GB,4vCPUs,1–2SSDs - replicas:
1
- 3 data nodes, each:
-
Production moderate cluster (search + indexing):
- 3–5 hot data nodes, each:
64GBRAM, heap31GB,8–16vCPUs, NVMe SSDs - 3–5 warm nodes with larger, cheaper storage,
64–128GBRAM with smaller heap - use ILM to roll indices from hot -> warm
- 3–5 hot data nodes, each:
Example calculation (very simplified): expected daily ingestion = 100GB/day raw documents
- Retention: 30 days -> 3TB raw
- With replicas=1 and ~20% indexing overhead and merges -> ~7.2TB usable required
- Choose node count and disk sizes such that per-node disk usage stays below 70% with room for snapshots and rebalancing.
Operational guidance
- Avoid swapping: configure the OS to avoid swapping Elasticsearch processes and reserve memory for the OS.
- Set
bootstrap.memory_lockand lock the JVM heap if appropriate to prevent swapping. - Use
discovery.seed_hostsand proper discovery settings for cluster formation. - Regularly run queries that reflect production patterns during capacity planning and scaling tests.
References (official Elasticsearch documentation)
- JVM heap sizing and guidelines
- Important settings and node memory layout
- Scale the cluster and capacity planning
- Size shards (production guidance)
- Index Lifecycle Management (ILM)
- Monitoring Elasticsearch clusters
Further reading and tools
- Elastic's sizing and performance blog posts and the Elasticsearch forums for case studies and examples.
- Use Rally (Elastic's benchmarking tool) to run realistic indexing and search benchmarks