Talking nerdy continued – Bloom Filter
by Security Dude
A Bloom filter, conceived by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive retrieval results are possible, but false negatives are not; i.e. a query returns either “inside set (may be wrong)” or “definitely not in set”. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.
Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries.
Learning how it all goes together
So I found some example on the interwebs. Most had implementations with MD5 algorithms which
I believe is incorrect (no need for crpyto-hash). (There are two approaches that I found generate results in line with the theoretical predictions. One is to use a cryptographic hash function like SHA-1. Cryptography’s requirements may be overkill for a bloom filter, but crypto hashes do spread their output uniformly across the hash space. So SHA-1 works fine, and you can feed the hash a constant or part of the earlier-generated hash to generate extra data as needed so you’re not limited to 5 hashes.) Python sklearn has an murmurhash implementation. So what do we do? Hash off.
import time import random import string from hashlib import md5 from sklearn.utils import murmurhash3_32 digits = string.lowercase + string.uppercase + string.digits def test_murmurhash_sklearn(iterable): [ murmurhash3_32(item) for item in iterable ] def test_md5_hexdigest(iterable): # mask global variables [ md5(item).hexdigest() for item in iterable ] def timeit(func, *args, **kwargs): #higher order function now = time.clock() func(*args, **kwargs) return time.clock() - now # start generator gen_str = (''.join(random.sample(digits, 10)) for _ in range(1000000)) print timeit(test_md5_hexdigest, gen_str) # start generator gen_str = (''.join(random.sample(digits, 10)) for _ in range(1000000)) print timeit(test_murmurhash_sklearn, gen_str)
Learning about ByteArray builtin function in python
ByteArrays can be used for many things related to data manipulation, and other things like saving game data online, encrypting data, compressing data, and converting a BitmapData object to a PNG or JPG file. I’m sure that the different doc types all have a structured byte array beginning value. To see more types I import the struct module.
This module performs conversions between Python values and C structs represented as Python strings. This can be used in handling binary data stored in files or from network connections, among other sources. It uses Format Strings as compact descriptions of the layout of the C structs and the intended conversion to/from Python values.