Notes from Google "Flash" talk at UCSD non-volatile memories workshop

I was at the UCSD workshop on Non-Volatile Memories about three weeks ago, and had a surprisingly great time. I say "surprisingly" because I showed up at the reception the first night, realized I didn't know a single person there, and thought "uh-oh." That "uh-oh" turned into "ooh!" the next day -- I learned a surprising amount about the lower levels of contemporary nonvolatile memory technology and met some very cool folks.
Many of the slides from the talks are online (though, as in all things, the hallway conversations were both unrecorded and perhaps as or more useful). But one of the stand-out talks isn't -- Al Borchers talk about Google's experiences with Flash memory. I've jotted down some highlights from the talk that jumped out at me. Caveat: These are filtered through my own interest, and a lot of what really jabbed in my head echos our own experiences with FAWN, and several reinforced things I said in my talk the day before so if something seems odd, it's probably my fault, and not Al's.
Al Borchers is in the platforms group developing system software for Ggogle's server platforms, and has been working on high performance storage devices. Ph.D. in theoretical CS from Minnesota (1996), has been hacking unix and linux device drivers and systems software in industry. Much of the talk he gave involved work with Naveen Upta, Tom Keane, Kyle Nesbit.
Looking at HW devices and how SW could be modified to take advantage of Flash, if necessary: "It has been a rocky experience with flash. We've had difficulties with performance and reliability of devices, and figuring out where we can apply flash in a cost-effective way. Many applications... some obvious some not. Without forcing apps to change too radically.
App trends:
  • High tput large data workloads are seek limited
  • Apps are latency sensitive - 99.9th %ile latency, disk queueing, might see 100ms latency waiting to read off of disk.
  • [Started out by mentioning the high IOPS-DRAM, low IOPS-disk duality]
  • Disk capacity grows faster than seeks, which are basically constant
  • Waste capacity to provide seeks, or share and prioritize. We just buy more and more disks.
  • Flash between RAM and disk in price and performance
  • Flash provides seeks
Application: BigTable [Work from Kyle Nesbit]
  • Key/value store API, large memory footprint for cache, GFS for persistent storage
  • GFS: File API, data stored on disk, replication for fault tolerance
Options: Could use flash as a cache, put it on chunkservers, or on bigtable servers. Looking at it most in the chunk servers.
  • Instrumented Bigtable tablet server
  • Collected traces of Bigtable accesses
  • Ran trace simulations offline about cache size and replacement policy
  • For several Google apps
1) Bigger caches almost always better. 16 .. 512 GB cache went from 450 cache misses/second to 130 cache misses/sec. Linear reduction with exponential increase in cache size.
  • Flash as cache can be very write-intensive
  • Write performance is hugely variable and almost chaotic based on workload.
  • Sequential writes tend to work well, but random writes...
  • Device also wears out, and wear depends on workload.
  • Lifetime affects cost of the solution - how long will it last?
  • LRU: Good hit rate, but bad lifetime/perf with lots of random writes
  • FIFO: Bad hit rate, good lifetime, perf
  • Research opportunity: Replacement policy with good hit rates and good lifetime.
  • Difference between the two evens out as the cache size grows, however... 512GB FIFO nearly as good as LRU.
  • LRU vs FIFO estimated lifetime:
    • FIFO gets 2x the lifetime as LRU based on their simulations. They monitored how many GCs the device was doing.
2) CPU overhead
  • Flash shifts from seek-bound to CPU-bound
  • High IOPS but proportional CPU overhead
  • Storage stack large fraction of application CPU usage
  • CPU overhead depends on h/w interface, driver, block layer, I/O scheduler, file system, NUMA
  • CPU overhead depends on application programming model--direct I/O, async/sync IO, single/multi threaded I/O (most experiments used direct I/O, peforms better. Async better than sync, single thread better than mutli... but apps may feel otherwise!)
  • As CPU overhead, with highest perf devices, we use one core to do I/O, and then we get NUMA effects as things communicate over the system bus.
Al then gave some numbers on the relative performance of, e.g., PCI vs SAS vs SATA drives they'd measured. I didn't write all of these down well enough to be confident reposting them. The gist was that access over PCI incurred less CPU overhead than SAS and SATA. NUMA access - when you had to go through a different core - hurt just as much.
CPU use for async/multithread: At high BW, sync multithreaded model uses 2-3x the CPU. They didn't really see that in SATA because it was limited to 31 outstanding requests by NCQ.
Overhead summary:
  • Block layer: Even raw block device with direct IO can add 2x to 3x CPU overhead of accessing RAM
  • IO scheduler adds 30% overhead, unnecessary disk optimizations and lock contention (one device could cut that out)
  • FS added 39% overhead; metadata writes and cold data hurts perf and lifetime on some filesystems
  • if you just have a few files, you can do a very simple userspace FS...
  • NUMA 20% to 40% more overhead
  • IO size constrained by flash page/block sizes
  • Doesn't mean flash has to be a block device in the storage stack
  • New level in storage hierarchy needs a new interface
  • Next gen high speed NAND devices will have higher IOPS
  • Research opportunities:
    • Optimize storage stack for flash performance characteristics
    • Other interfaces for flash -- for example, user space IO
Problem 3: Error rates
  • Read error rate higher than predicted by bit error rate
  • Block, plane, and die failures seem to dominate
  • Errors seem concentrated in a few bad chips
  • (haven't watched the devices to see errors through wearout and aging)
  • Small impact to caching apps -- cache miss
  • Large impact to DB apps -- data loss
  • Looking at RAID, but w/traditional RAID drivers, perf is terrible. Adds another layer of CPU overhead. Looking at optimized RAIDs for flash, would like to see more...
  • research op: fault tolerance in flash storage devices, long term, large scale failure rates and mechanisms.
Q: Which drives did you use?
A: Doesn't matter. Can't say. But all of the devices suffer perf overhead.
Q: Can you comment SLC vs MLC on reliability?
A: Our initial reliability of SLC seemed a little bit better, but we haven't taken them to life and worn them out, but for both we saw a lot of early-life failures.
  • We feel like we're forced to MLC...
  • encouraged that mfgs are talking about enterprise MLC... discouraging looking at commodity flash chips that have shorter and shorter lifetimes
Q: Comment on pci-express as interface?
A: We like it better, it seems to perform better, lower overhead, ...
.. more about high overhead of going through block layer to get to SSDs at high IOPS.
All in all, it was an excellent talk, and shows that Google has been taking a very serious look at Flash in their datacenters. We're seeing a lot of indicators that Flash is poised -- but not completely ready yet -- to start making huge inroads into the DC.

Old comments

  • Benoit Hudson wrote: At a high level this sounds remarkably similar to talks I went to in the mid90s that used disk as a cache for tape. The specifics are different (tape was the error-prone medium then).
    • Dave Andersen replied: Everything old is new again. :-) We have a SIGCOMM paper that was just accepted on using circuit switched optics in the datacenter. Back to the future...
      I hadn't seen your accident report (rock in eye). Yowza - glad everything ended okay!


Popular posts from this blog

Reflecting on CS Graduate Admissions

Minting Money with Monero ... and CPU vector intrinsics

Finding Bugs in TensorFlow with LibFuzzer