Skip to content

Reservoir sampling

There are 2 official labels for the sampling of reservoirs in Leikou. According to my question, there may be three or four. The proportion is relatively low, so you can choose to master it according to your actual situation.

The algorithmic thinking of reservoir sampling is very clever, the code is simple and easy to understand, even if you don't master it, it is also very good for understanding.

Problem Description

Given a data stream, we need to randomly select k numbers in this data stream. Because the length of this data stream is very large, it needs to be traversed and processed at the same time, and it cannot be loaded into the memory all at once.

Please write a random selection algorithm so that all the data in the data stream are selected with equal probability.

There are many expressions of this problem. For example, let you randomly extract k points from a rectangle, randomly extract k words from a word list, etc., and ask you to wait for the probability of random extraction. No matter how the description changes, its essence remains the same. Today we will take a look at how to do this kind of problem.

Algorithm Description

This algorithm is called reservoid sampling.

The basic idea is:

-Construct an array of size k, and put the first k elements of the data stream into the array. -No processing is done first for the first k numbers of the data stream. -Starting from the k + 1 number in the data stream, choose a number rand between [1, i], where i represents the current number. -Do nothing if rand is greater than or equal to k -If rand is less than k, swap rand and i, that is, select the current number to replace the selected number (spare tire). -Finally return to the surviving spare tire

The core of this algorithm is to select a number with a certain probability first, and then replace the selected number with another probability in the subsequent process. Therefore, in fact, the probability of each number being selected is the probability of being selected*the probability of not being replaced.

Fake code:

An algorithm book referred to by the pseudo-code, with slight modifications.

Init: a reservoir with the size: k
for i = k+1 to N
    if(random(1, i) <k) {
        SWAP the Mth value and ith value
    }

Can this ensure that the selected number is equal probability? The answer is yes.

-When i <= k, the probability of i being selected is 1. -When the k + 1 number is reached, the probability of the k + 1 number being selected (the probability of entering the if branch above) is \(\frac{k}{k+1}\), to the k + 2 number When counting, the probability that the k + 2 number is selected (the probability of entering the if branch above) is \(\frac{k}{k+2}\), and so on. Then the probability that the nth number is selected is \(\frac{k}{n}\) -The probability of being selected is analyzed above, and then the probability of not being replaced is analyzed. When the number k + 1 is reached, the probability that the first k numbers will be replaced is \(\frac{1}{k}\). When the first k + 2 numbers are reached, the probability of the k + 2 number being replaced is \(\frac{1}{k}\), and so on. That is to say, the probability of all being replaced is \(\frac{1}{k}\). Knowing the probability of being replaced, then the probability of not being replaced is actually 1-the probability of being replaced.

Therefore, for the first k numbers, the final probability of being selected is 1 * the probability of not being replaced by k + 1* the probability of not being replaced by k + 2* ... the probability of not being replaced by n, that is, 1 * (1-the probability of being replaced by k + 1) * (1-the probability of being replaced by k + 2) * ... (1-the probability of being replaced by n), ie $1 \times (1-\frac{ k}{k+1} \times \frac{1}{k}) \times (1-\frac{k}{k+2} \times \frac{1}{k}) \times ... times (1-\frac{k}{n} \times \frac{1}{k}) = \frac{k}{n} $.

For the i-th (i> k) number, the final probability of being selected is the probability of being selected at the i-th step* the probability of not being replaced by the i + 1 step* ... * not being replaced by the n-th step The probability of $\frac{k}{k+1} \times (1-\frac{k}{k+2} \times \frac{1}{k}) \times ... \times (1 -\frac{k}{n} \times \frac{1}{k}) = \frac{k}{n} $.

In short, no matter which number it is, the probability of being selected is \(\frac{k}{n}\), which satisfies the requirement of equal probability.

-382. Linked List Random Node -398. Random Number Index -497. Random points in non-overlapping rectangles

to sum up

The core code of the reservoir sampling algorithm is very simple. But it's not easy to think of, especially if you have not seen it before. The core point is that the probability that each number is finally selected is the probability of being selected*the probability of not being replaced. So we can adopt a certain dynamic method to make every round have probability to select and replace some numbers. We have given the proof process of equal probability above, so you might as well try to prove it yourself. After practicing with the related topics at the end of the article, the effect will be better.