Making a "hash" of things
I seem to talk a lot about "hashes" and it's easy to assume that people know all about them.
Put simply a hash function takes an input and outputs a numeric value.
The value output can be used in checksums, to speed data storage and retrieval and a million other purposes.
Hash functions are varied any may have special properties.
the SHA-1 hash for example produces a long 160bit hash value that is so unlikely to be the same for different inputs, it is used to index all revisions of all files in "git" repositories on the assumption no two files or their revisions will ever have the same SHA-1 hash code.
SHA-1 is of course a little complex.
A simpler use of a hash is if we are storing a series of values and what to see if a particular value is stored.
If we put things in an unordered array, all we can do is look at each element of the array in turn until we either find the value or have looked at all elements. This would be very slow.
An alternative would be keeping the array in order, but that is still not exactly a fast approach.
The hashing approach is to have an array where we convert the hash of a value into an index into the array generally using a modulus of the hash that is the length of a preallocated array.
hash = hash_function(value)
index = hash % length
If the array element is unpopulated then the value does not exist. If it does exist then it gets interesting as in this scheme no matter how good the hash index may come out the same for different values, this is called a collision. How to deal with this type of storage in practice is a whole topic on of itself.
The key point though is that as collisions are expected we don't need to waste compute time on a very good hash.
The counterpoint is that some hash functions are really not very good and can effect performance.
Lets see for ourselves.
Comments
Post a Comment