Cloud NVMes: the blind side of them

Rimantas Ragainis
6 min readApr 15, 2020

Intro

So you know that NVMes are much faster than ordinary SSD disks. We’ve used NVMes for several years on AWS now but there came a moment to migrate part of IT infrastructure to GCP due to temptation to save some money. GCP vs AWS charts do not indicate any significant different in terms of IO read/write performance whatsoever but the pricing do, so we took off. Apparently, it’s naive to think so (obviously! #1 mistake) and we started noticing some red flags along the way. One of our heavily used Percona MySQL 5.7 servers started experience performance lag in terms of replication process.

replication lag between GCP and AWS

We thought it’s just an expected network issue between different DCs (#2 mistake). Apparently network had nothing to do to it once master & replication nodes were migrated to the same DC — GCP. So we started digging deeper on how NVMes are actually performing and found out it’s disk itself.

For the following tests we were using these VM instance types:
GCP — n1-standard-32, 8 x 375 GB NVMe (local-ssd)
AWS — i3.4xlarge, 2 x 1.9 TB NVMe

Test case #1

First test case was to use simple fio tool. It’s straightforward way to check disk throughput of desired block size actually, but there are some caveats on how properly to run a test case — you should clear PageCache up before read step (e.g.: sync; echo 3 > /proc/sys/vm/drop_caches); or delete existing file before write step.

For the tests we did used both XFS and EXT4 journaling file system types just to be sure there is no specific difference in that criteria.
So the test command goes as:

fio --time_based --name=benchmark --size=10G --runtime=30 \
--filename=/mysql/test --ioengine=libaio --randrepeat=0 \
--iodepth=128 --direct=1 --invalidate=1 --verify=0 \
--verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=Xk \
--group_reporting
fio read results per different block size
fio write results per different block size

From the first sight, results kind of confirm the issue we’re facing — slow disks. Well at least in terms of read operations GCP NVMe do look slower comparing to AWS NVMe.

But there is one angle we should look from — the default MySQL InnoDB page size is 16KB. It is recommended to keep MySQL page size close to the storage device block size in order to minimize fsync() events (flushing data to disk) and that’s what differs between GCP and AWS:

# GCP
# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 402.7 GB, 402653184000 bytes, 98304000 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
# AWS
# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 1900.0 GB, 1900000000000 bytes, 3710937500 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

As you can see none of them is equal to 16KB, but we do notice AWS’s disk to be faster. Why is that? I found interesting post about small page sizes and performance by Vadim Tkachenko, which, sadly, just confirms that same InnoDB page size on different manufacturer’s SSD disk may be significantly different in terms of performance.

Of course, it’s just an assumption that aligned disk block size against MySQL InnoDB page size would help, but we tried this road nevertheless — it didn’t went well due to limitations:

# mkfs.xfs -f -b size=512 -s size=512 /dev/nvme0n1 
Minimum block size for CRC enabled filesystems is 1024 bytes.
# mkfs.xfs -f -m crc=0 -b size=512 /dev/nvme0n1
block size 512 cannot be smaller than logical sector size 4096

Test case #2

Second test scenario was aimed to use sysbench tool. As you may know it’s one of the most popular tools in SysAdmins / DBAs communities. So we ran tests on fileio test with random write/read scenario.

sysbench --file-test-mode=rndrw \
fileio run
sysbench throughput results (default fsync)
sysbench file operations results (default fsync)
sysbench file operations results, table (default fsync)

Results with default sysbench options just conclude our suspicions — it looks AWS NVMe is way faster than any on GCP side. Also it’s worth to note that standard GCP SSD looks faster than own GCP NVMes (!) as well. This is disturbing. It seems the default fsync() handling is the key here.

To make the best of NVMe, we tried minimizing fsync() rate as much as possible with flag “--file-fsync-freq=0”. Also, as our MySQL service is running with “innodb_flush_method=O_DIRECT”, we chose appropriate flag “--file-extra-flags” to make the test case more production look-alike:

sysbench --file-test-mode=rndrw \
--file-extra-flags=direct \
--file-fsync-freq=0 \
fileio run
sysbench throughput results (no fsync)
sysbench file operations results (no fsync)
sysbench file operations results, table (no fsync)

Now at this point it’s obvious fsync() handling plays significant role in disk performance. Question is why the default handling is so different on GCP side rather than on AWS keeping in mind that MySQL configuration is identical?

NVMe experimental feature

Apparently it’s already known issue for GCP engineers and they are working on this for some time now. We’ve found out that by contacting Customer Engineer (CE) Dmitriy Novakovskiy. GCP team has given us a chance to test experimental/alpha NVMe feature which goal is to improve fsync() performance before it’s official release. Both sysbench test and actual MySQL replication lag results do look promising.

sysbench throughput results (no fsync), including GCP experimental NVMe
sysbench file operations results (no fsync), including GCP experimental NVMe
sysbench file operations results, table (no fsync), including GCP experimental NVMe
replication lag with local-ssd
replication lag with experimental local-ssd

Assumptions / Conslusions

The first two — fio and sysbench — test scenarios indicated that GCP NVMes are slower than on AWS side. Per our insights, the main key points which may interfere with disk performance are:

  • disk block / sector size — could improve performance assumingely, but cannot be changed though
  • default fsync() handling by disk driver — explicitly minimized rate of fsync() with sysbench tool shows significant improvement, but it’s unclear why the default rate — which is adopted by MySQL — performs so poorly

The last test case with experimental GCP NVMe showed significant improvement of IO throughput and file read/write operations comparing to both AWS NVMe and current GCP NVMe disks.

We do hope GCP team will deliver experimental (alpha) feature to stable one as soon as possible.

--

--