Skip to content

Bloom filter

Scenes

Suppose you want to deal with such a problem now, you have a website and have many visitors. Whenever a user visits, you want to know if this IP is the first time to visit your website.

hashtable can it

The obvious answer is to store all IPs in a hashtable, fetch them from the hashtable for each visit, and then judge. But the title says that the website has many visitors. If 1 billion users have visited, assuming the IP is IPV4, then the length of each IP is 4 bytes, then you need 4 * 1000000000 = 4000000000Bytes = 4G in total.

If it is to judge the URL blacklist, since each URL will be longer (may be much larger than the 4 bytes of the above IPV4 address), the space required may be far greater than your expectations.

bit

Another solution that is a little harder to think of is bit. We know that bit has two states of 0 and 1, so it is more appropriate to indicate existence and nonexistence.

If there are 1 billion IPs, you can use 1 billion bits to store, then you need a total of 1 * 1000000000 = (4000000000 / 8) Bytes = 128M, which becomes 1/32 of the original. If it is to store URL, this kind of change Longer strings are more efficient. The question is, how do we associate IPV4 with the position of the bit?

For example, which digit should be used for 192.168.1.1, and which digit should be used for 10.18.1.1? The answer is to use a hash function.

Based on this idea, we only need two operations, set(ip) and has(ip), and a built-in function hash(ip) to map IP to bit table.

This has two very fatal disadvantages:

  1. When the sample distribution is extremely uneven, it will cause a lot of space waste

We can solve it by optimizing the hash function

  1. When the element is not an integer (such as URL), BitSet is not applicable

We can still use the hash function to solve it, and even hash it several times

Bloom filter

Bloom filter is actually bit + multiple hash functions. K times hash(ip) will generate multiple indexes, and set the binary of its k index positions to 1.

-If the values ​​of k index positions are all 1, then it is considered that there may be ** (because of the possibility of conflict). -If one is not 1, then **must not exist** (a value obtained by the hash function must be unique), which is also an important feature of Bloom filters.

In other words, the Bloom filter answered: may exist and must not exist.

bloom-filter-url

As can be seen from the above figure, the Bloom filter is essentially composed of a very long binary vector and multiple hash functions.

Since there is no 100% reliability of hashtable, this is essentially a reliability in exchange for space. In addition to reliability, the Bloom filter is also troublesome to delete.

False positive

The Bloom filter mentioned above answers the question: may exist and must not exist. So what should you do when the answer is may exist? Generally speaking, in order to kill a thousand by mistake rather than let one go, we think he exists. At this time, a false positive occurred.

The false alarm rate is inversely proportional to the length of the binary vector.

Application of Bloom Filter

  1. Web crawlers

Determine whether a URL has been crawled

  1. KV database determines whether a key exists

For example, each Region of Hbase contains a BloomFilter, which is used to quickly determine whether a key exists in the region during query.

  1. Phishing website identification

The browser sometimes warns the user that the website visited is probably a phishing website, and this technology is used

From this algorithm, everyone can have a better understanding of tradeoff.

  1. Malicious website identification

In short, if you need to judge whether an item has appeared in a collection, and you need to be 100% sure that it has not appeared, or may have appeared, you can consider using a Bloom filter.

Code

public class MyBloomFilter {
     private static final int DEFAULT_SIZE = 2 << 31;
     private static final int[] seeds = new int [] {3,5,7,11,13,19,23,37 };
     private BitSet bits = new BitSet(DEFAULT_SIZE);
     private SimpleHash[] func = new SimpleHash[seeds.length];

     public static void main(String[] args) {
        //use
        String value = "www.xxxxx.com";
        MyBloomFilter filter = new MyBloomFilter();
        System.out.println(filter.contains(value));
        filter.add(value);
        System.out.println(filter.contains(value));
    }
    //Constructor
     public MyBloomFilter() {
         for (int i = 0; i <seeds.length; i ++) {
            func[i] = new SimpleHash(DEFAULT_SIZE, seeds[i]);
        }
    }
     //Add website
     public void add(String value) {
         for (SimpleHash f: func) {
            bits.set(f.hash(value), true );
        }
    }
     //Determine whether the suspicious website exists
     public boolean contains(String value) {
         if (value == null) {
             return false;
        }
         boolean ret = true;
         for (SimpleHash f: func) {
            //The core is through the "and" operation
            ret = ret && bits.get(f.hash(value));
        }
         return ret;
    }
}

to sum up

Bloom filter answers: may exist and must not exist. The essence is a trade-off between space and accuracy. There may be false positives in actual use. If your business can accept false positives, then using bloom filters for optimization is a good choice.