Talking nerdy continued – Bloom Filter

by Security Dude

Bloom filter, conceived by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive retrieval results are possible, but false negatives are not; i.e. a query returns either “inside set (may be wrong)” or “definitely not in set”. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.

Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries.

Learning how it all goes together

So I found some example on the interwebs. Most had implementations with MD5 algorithms which I believe is incorrect (no need for crpyto-hash). (There are two approaches that I found generate results in line with the theoretical predictions. One is to use a cryptographic hash function like SHA-1. Cryptography’s requirements may be overkill for a bloom filter, but crypto hashes do spread their output uniformly across the hash space. So SHA-1 works fine, and you can feed the hash a constant or part of the earlier-generated hash to generate extra data as needed so you’re not limited to 5 hashes.) Python sklearn has an murmurhash implementation. So what do we do? Hash off.

import time
import random
import string
from hashlib import md5
from sklearn.utils import murmurhash3_32

digits = string.lowercase + string.uppercase + string.digits

def test_murmurhash_sklearn(iterable):
    [ murmurhash3_32(item) for item in iterable ]

def test_md5_hexdigest(iterable): # mask global variables
    [ md5(item).hexdigest() for item in iterable ]

def timeit(func, *args, **kwargs): #higher order function
    now = time.clock()
    func(*args, **kwargs)
    return time.clock() - now

# start generator
gen_str = (''.join(random.sample(digits, 10)) for _ in range(1000000))

print timeit(test_md5_hexdigest, gen_str)

# start generator
gen_str = (''.join(random.sample(digits, 10)) for _ in range(1000000))

print timeit(test_murmurhash_sklearn, gen_str)


Learning about ByteArray builtin function in python

ByteArrays can be used for many things related to data manipulation, and other things like saving game data online, encrypting data, compressing data, and converting a BitmapData object to a PNG or JPG file. I’m sure that the different doc types all have a structured byte array beginning value.  To see more types I import the struct module.

This module performs conversions between Python values and C structs represented as Python strings. This can be used in handling binary data stored in files or from network connections, among other sources. It uses Format Strings as compact descriptions of the layout of the C structs and the intended conversion to/from Python values.

http://en.wikipedia.org/wiki/Bloom_filter

http://code.activestate.com/recipes/577684-bloom-filter/

http://billmill.org/bloomfilter-tutorial/

http://axiak.github.com/pybloomfiltermmap/

https://sites.google.com/site/murmurhash/

http://dabeaz.blogspot.com/2010/01/few-useful-bytearray-tricks.html

http://docs.python.org/2/library/functools.html

http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

http://kmike.ru/python-data-structures/

Advertisements