Guava BloomFilter 实现

Guava BloomFilter 实现

Guava BloomFilter 改进点

  1. 重写了 bitset,可以参见 LockFreeBitArray类的实现 ,突破了int.maxsize 大约21亿的限制
  2. 采用MurmurHash3 非加密型哈希函数,对大块的数据,具有较高的平衡性与低碰撞率。en.wikipedia.org100w字符串hash,1s+
  3. 论文Less Hashing, Same Performance!Building a Better Bloom Filter,提到只需要两个;论文用google翻译了一部分,见
  4. 32位散列函数这个技巧并不会显着降低Bloom过滤器的性能;guava 具体做法见 BloomFilterStrategies 类的MURMUR128_MITZ_32MURMUR128_MITZ_64两种策略的实现
  5. 添加元素过程中,并非执行了 optimalNumOfHashFunctions 次hash 操作,改为一次MurmurHash3,和optimalNumOfHashFunctions次基础运算,大大减少了计算次数,提高了性能
  6. BloomFilter 一直被标记为@Beta,但已经在google很多产品中得到运用

BloomFilter 类结构图

google guava 中bloomfiter 的public 方法如下图:

BloomFilter 源码分析

实际使用,调用create 方法创建bloomfilter;再put/putAll进bloomfilter;再通过mightContain方法来判断元素是否存在于bloomfilter 中:

调用

1
2
Charset charset = Charset.forName("utf-8");
BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(charset),100000000L,0.001F);

创建BloomFilter:

1
2
3
4
5
//# create 方法
public static <T> BloomFilter<T> create(
Funnel<? super T> funnel, long expectedInsertions, double fpp) {
return create(funnel, expectedInsertions, fpp, BloomFilterStrategies.MURMUR128_MITZ_64);
}

根据预期插入元素数、错误率 求得 最优bitset 个数和最优hash 函数个数;再创建BloomFilter;其中关注下 LockFreeBitArray 的实现,java.util.BitSet 只接收int 参数,也就是最大只能设置Integer.MAX_SIZE大约21亿的Set;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
@VisibleForTesting
static <T> BloomFilter<T> create(
Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
/** 省略参数检查*/
/*
* TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size
* is proportional to -log(p), but there is not much of a point after all, e.g.
* optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!
*/
//最优 number of bit set
long numBits = optimalNumOfBits(expectedInsertions, fpp);
//最优 hash函数个数
int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
try {
// 这里用 用LockFreeBitArray,因为java.util.bitset 最大个数只能使用int最大值来体现,显然,int 已经不满足google了
return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);
} catch (IllegalArgumentException e) {
throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
}
}
//# 创建 bloomFilter
private BloomFilter(
LockFreeBitArray bits, int numHashFunctions, Funnel<? super T> funnel, Strategy strategy) {
checkArgument(numHashFunctions > 0, "numHashFunctions (%s) must be > 0", numHashFunctions);
checkArgument(
numHashFunctions <= 255, "numHashFunctions (%s) must be <= 255", numHashFunctions);
this.bits = checkNotNull(bits);
this.numHashFunctions = numHashFunctions;
this.funnel = checkNotNull(funnel);
this.strategy = checkNotNull(strategy);
}

put 元素,变更set 中bit 的值;

MURMUR128_MITZ_64为例:通过将元素进行MurmurHash3 求得128位共16字节的byte数组,将byte数组拆分位两个8字节的long型的hash值;然后基于这两个hash值,计算得到数组下标,循环前面算出来的最优hash函数次数,依据index设置LockFreeBitArray[index]=1。源码可以看出,求出来的最优hash 函数次数,并不是进行了多少次的hash操作,实际上只进行了一次hash操作,循环中只是做了一些位移、反转、加乘操作;性能大大提升

MURMUR128_MITZ_64MURMUR128_MITZ_32 策略的实现请参考 Less Hashing, Same Performance!Building a Better Bloom FilterMurmurHash3

MURMUR128_MITZ_64MURMUR128_MITZ_32更简单,因为避免了循环中的乘法,并且执行更简单 ,仅仅是 + = hash2。 还通过与Long.MAX_VALUE进行AND运算而将索引更改为正数,而不是翻转位。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
@CanIgnoreReturnValue
public boolean put(T object) {
return strategy.put(object, funnel, numHashFunctions, bits);
}
//MURMUR128_MITZ_32 策略下的put 方法:
MURMUR128_MITZ_32() {
@Override
public <T> boolean put(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
int hash1 = (int) hash64;
int hash2 = (int) (hash64 >>> 32);
boolean bitsChanged = false;
for (int i = 1; i <= numHashFunctions; i++) {
int combinedHash = hash1 + (i * hash2);
// Flip all the bits if it's negative (guaranteed positive number)
if (combinedHash < 0) {
combinedHash = ~combinedHash;
}
bitsChanged |= bits.set(combinedHash % bitSize);
}
return bitsChanged;
}
//MURMUR128_MITZ_64 策略下的put 方法:
MURMUR128_MITZ_64() {
@Override
public <T> boolean put(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
boolean bitsChanged = false;
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
combinedHash += hash2;
}
return bitsChanged;
}
private /* static */ long lowerEight(byte[] bytes) {
return Longs.fromBytes(
bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}
private /* static */ long upperEight(byte[] bytes) {
return Longs.fromBytes(
bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}

检查是否包含:mightContain函数;该函数实现和put差不多,只是put 是做set操作,而mightContain是做检测操作,检测 该元素hash后,在numHashFunctions个数组下标的值是否均等于1;如果都等于1 ,则判断 可能包含(还是存在false positive rate).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@Override
public <T> boolean mightContain(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
return false;
}
combinedHash += hash2;
}
return true;
}

参考

  1. https://github.com/google/guava
  2. https://en.wikipedia.org/wiki/MurmurHash
  3. https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
  4. https://www.elastic.co/blog/all-about-elasticsearch-filter-bitsets
坚持原创技术分享,您的支持将鼓励我继续创作!
分享