Some helpful patterns fell out of our experience, even though they weren't goals originally:
* Use aggressive timeouts to cut off the long tail.
You can't ever shake out all the unfairness in the system, so some requests will take an unreasonably long time to finish — way over the 99.9th percentile. If there are multiple stateless app servers, you can just cut a client loose when it has passed a "reasonable" amount of time, and let it try its luck with a different app server.
* Make every case an error case.
Or, to put it another way, use the same code path for errors as you use in normal operation. Don't create rarely-tested modules that only kick in during emergencies, when you're least likely to feel like trying new things. We queue all write operations locally (using Kestrel as a library), and any that fail are thrown into a separate error queue. This error queue is periodically flushed back into the write queue, so that retries use the same code path as the initial attempt.
* Do nothing automatically at first.
Provide lots of gauges and levers, and automate with scripts once patterns emerge. FlockDB measures the latency distribution of each query type across each service (MySQL, Kestrel, Thrift) so we can tune timeouts, and reports counts of each operation so we can see when a client library suddenly doubles its query load (or we need to add more hardware). Write operations that cycle through the error queue too many times are dumped into a log for manual inspection. If it turns out to be a bug, we can fix it, and re-inject the job. If it's a client error, we have a good bug report.