Elasticsearch’s DevOps head start. Part 1. Operating system.
This post discusses choices DevOps faces in a fresh project, which can take advantage of the newest version of Elasticsearch (7.14) and how to grow such a cluster from PoC up to a fully fledged production cluster.
Preface
As DevOps, our task is to provision infrastructure and ensure service is up and running, constantly available and stable. There are manuals and official “getting started” guidebooks, which you should be familiar with. But here, however, we will focus on aspects that are beyond the scope of official documents. Before we dive into technical considerations, let me give you some baseline idea of what Elasticsearch is from a DevOps perspective. On a very simplistic level, it’s just a program that provides JSON for applications via HTTP api. Although considered a database, the proper technical term is a datastore. The differences go far, but for now I’ll signal just a handful: a database stores all the information you feed it, but it can do more than that, including creating Views or executing Stored Procedures. While databases aggregate and modify data, datastore is used for near real-time search. A database aims to be as complete as possible, while datastore ensures data persistence for a limited period, hence old entries are deleted. More often than not you want to grade between what is actually useful in the long term and what can be removed. This brings the benefit of cost savings and search speed.
We’ll kick things off here today with a look at how to avoid early pitfalls
Server spec
Our cluster will grow, but it can start out very modest – with 2 CPU, 8 GB RAM machine, going lower than this will hinder the JVM. There won’t be much load early, so a single server will do all the work. Later on we will specialize each machine to a specific role. In production, even in the early stages, there should be a minimum of three masters in different zones and plenty of data nodes.
Masters can be fairly small sized machines – 1 CPU, 2 GB RAM. For data nodes, the rule of thumb is:
- at least 2 CPUs
- 4gb ram per core, but no more than 64gb.
- a max of 16 times RAM capacity for storage. It is better to have more nodes. With a maximum of 64gb * 16 – no need for inode64.
The Kernel impacts performance a lot, both with improvements and occasional bugs, so consider this change carefully. A common approach is to stay on the latest kernel (-1) if possible. You can be on the newest one, but test it first on lower environments. One more thing: “database servers shouldn’t be run on VMs | Cloud” – when this gets to be a problem, there should be a revenue stream to address it.
The operating system is the foundation of the whole system running on it. At the same time, a database is not merely a “program”. Its system augmentation – hosts will require the default settings to be customized. Just to have one less thing to worry about – use .deb / .rpm to apply them automatically.
Changes you’ll need to apply to the system yourself:
echo 'vm.swappiness=1' >> /etc/sysctl.conf echo 'net.ipv4.tcp_retries2=5' >> /etc/sysctl.conf sysctl -p
Both changes will be discussed in upcoming parts.
CPU
The CPU model becomes important at some point. There are options to fine-tune multi-tiered compilation for performance. However, since we’ll be starting with a rented VM, just to observe our cluster growth, don’t worry about it just yet.
When your organization starts to consider moving to a physical machine, you will want to start looking at compiling elasticsearch on your own.
RAM
As long as we operate on a server with at most 64 GB RAM, we can ignore hugepages topics. There are diminishing returns for having nodes with massive RAM and storage.
Elasticsearch uses RAM not only for the JVM heap, which shouldn’t take more than 50% of available memory, and preferably less than 8gb (more about it in part 2). Whatever memory is available will serve for the memory-mapped file system – files will be kept in memory and used directly from it for speed.
Networking
Once we have more than one node, and if hosting provider allows it, we could activate the Jumbo frames.
ifconfig eth0 mtu 9000.
Benefit of above materializes when there is need to balance shards across nodes to avoid the watermark threshold.
Data disk
Putting the database on the system disk is just asking for all kinds of trouble, we should always attach a dedicated disk. The most important metric of such a disk is IOPS. Disk usage will fluctuate, growing as data arrives, and bloom even more during segment merges, to shrink back once the merge is complete. At 85% of any data disk capacity, the whole node will go read-only, till shards get re-balanced between nodes. Keep at least 30% of the disk free as a reserve. Elasticsearch will ensure reliability, so raid0 is perfectly acceptable.
The preferred I/O scheduler is either deadline or mq-deadline. Bear in mind that on Azure, device names are not stable between restarts. But there is no downside for changing I/O scheduler for each device:
# /etc/systemd/system/io-scheduler.service [Unit] Description=Set I/O scheduler After=local-fs.target [Service] Type=oneshot ExecStart=/bin/bash -c 'echo deadline > /sys/block/sd*/queue/scheduler' TimeoutSec=0 RemainAfterExit=yes [Install] WantedBy=multi-user.target
Filesystem
Elasticsearch allows for the use of multiple data paths, but at the same time, well, it doesn’t handle them well. Shards won’t be balanced between data.paths on the node. If a single disk gets marked with 85% watermark, then the node as a whole will also be marked read-only. In AWS, EBS allows you to resize the disk without detaching it, just remember to grow the filesystem afterwards. On-premise use LVM, just point data.path to mount_point. Use LVM in Azure as well, unless machine downtime is acceptable. If you exceed 4 disks added, it’s smart to replace them with a single one. Backup does not contain indices marked to delete, so restored data will be faster to search through. A single disk will have more IOPS too. And to top it all off, segments can be merged while the old infrastructure is still operating. The new cluster is ready once data is restored to point in time. If you are worried that we are going to waste hours and hours, remember that we are still operating on a small volume of data. Also, the old cluster will be available till the new cluster is ready. The end game is to mount a single 1tb data disk, of which we will use about 70%. Remember to adjust the CPU count for RAM and storage growth. Since with capacity we also gain IOPS, we change the I/O scheduler to NOOP – both for a lower CPU load and an even speedier response (which is now beneficial since the IOPS are maxed).
Final remarks:
- LVM does not impact performance much
- XFS and EXT4 are both recommended
- mount with noatime
Since LVM commands is not something people keep in L1 cache – here is script:
#!/usr/bin/env bash # by Karol Bartnik, use and share freely. mount_point=$1 `sudo snap install jq` devices_count=`/bin/lsblk -o NAME -n --json | jq .blockdevices | jq 'length'` for (( device_index=0; $device_index<$devices_count; device_index++ )) do device_name=`/bin/lsblk -o NAME -n --json | jq .blockdevices | jq .[$device_index].name` if [[ $device_name =~ "sd" ]]; then devices_children_count=`/bin/lsblk -o NAME -n --json | jq .blockdevices | jq .[$device_index].children | jq 'length'` if [[ $devices_children_count == 0 ]]; then disk_name="${device_name//\"}"1 dev_name="${device_name//\"}" `printf "n\np\n\n\n\nt\n8e\nw" | sudo fdisk /dev/$dev_name && sudo pvcreate /dev/$disk_name && sudo vgcreate -s 32M vg_1 /dev/$disk_name && sudo lvcreate -l +100%FREE -n lg_1 vg_1 && sudo mkfs.xfs /dev/vg_1/lg_1 && sudo mkdir -p $mount_point && echo /dev/mapper/vg_1-lg_1 $mount_point xfs defaults,discard,noatime 0 2 | sudo tee -a /etc/fstab && sudo mount -a` fi fi done
Or Ansible playbook for filesystem:
# add_filesystem.yml --- - hosts: elasticsearch_vm become: true become_method: sudo gather_facts: true roles: - add_filesystem # add_filesystem/vars/fs.yml --- fs_type: xfs mount_point: /var/lib/elasticsearch owner: elasticsearch # add_filesystem/tasks/main.yml --- # by Karol Bartnik, use and share freely. - name: vars inclusion include_vars: dir: vars - name: get device name ansible.builtin.set_fact: disk_name: "{{ item }}" dev_disk: "/dev/{{ item }}" partiton: "/dev/{{ item }}1" when: ansible_facts.devices.{{ item }}.partitions == {} with_items: "{{ ansible_facts.devices.keys() | select('match', '^sd(.*)$') | list }}" - name: prepare device if any include_tasks: prepare_device.yml when: disk_name is defined # add_filesystem/tasks/prepare_device.yml --- - name: create partition ansible.builtin.parted: device: "{{ dev_disk }}" number: 1 flags: [ lvm ] state: present - name: create volume group ansible.builtin.lvg: vg: vg1 pvs: "{{ partiton }}" pesize: "32" - name: resize volume group ansible.builtin.lvg: vg: vg1 pvs: "{{ partiton }}" - name: create logical volume community.general.lvol: vg: vg1 lv: lv1 size: +100%FREE - name: create filesystem ansible.builtin.filesystem: fstype: "{{ fs_type }}" dev: /dev/mapper/vg1-lv1 - name: mount with mapper mount: path: "{{ mount_point }}" src: /dev/mapper/vg1-lv1 fstype: "{{ fs_type }}" opts: defaults,discard,noatime state: mounted - name: chown mount_point {{ mount_point }} file: recurse: true path: "{{ mount_point }}" owner: "{{ owner }}" group: "{{ owner }}" state: directory mode: '0755'
Monitoring
If possible, add monitoring for CPU, I/O, networking (throughput), and memory usage.
Filesystem usage calls for particular attention, though how to go about that is beyond the scope of this article. Later on, you should observe JVM. We’ll be looking at which aspects in particular in the second part of this feature.