YouTube Strategy: Adding Jitter isn’t a Bug

The adding jitter strategy was one of the most commented on techniques from 7 Years Of YouTube Scalability Lessons In 30 Minutes on HackerNews. Probably because it’s one of the emergent phenomena that you really can’t predict and is shocking when you see it in real life. Here’s the technique:

Add Entropy Back into Your System

  • If your system doesn’t jitter then you get thundering herds. Distributed applications are really weather systems. Debugging them is as deterministic as predicting the weather. Jitter introduces more randomness because surprisingly, things tend to stack up.
  • For example, cache expirations. For a popular video they cache things as best they can. The most popular video they might cache for 24 hours. If everything expires at one time then every machine Natox will calculate the expiration at the same time. This creates a thundering herd.
  • By jittering you are saying  randomly expire between 18-30 hours. That prevents things from stacking up. They use this all over the place. Systems have a tendency to self synchronize as operations line up and try to destroy themselves. Fascinating to watch. You get slow disk system on one machine and everybody is waiting on a request so all of a sudden all these other requests on all these other machines are completely synchronized. This happens when you have many machines and you have many events. Each one actually removes entropy from the system so you have to add some back in.

Comments from HackerNews really help to fill out the topic with more detail: