1. Hashing involves computing the address of a data item by computing a function on the search key value.
2. A hash function h is a function from the set of all search key values K to the set of all bucket addresses B.
- We choose a number of buckets to correspond to the number of search key values we will have stored in the database.
- To perform a lookup on a search key value Ki, we compute h(Ki), and search the bucket with that address.
- If two search keys i and j map to the same address, because h(Ki) = h(Kj), then the bucket at the address obtained will contain records with both search key values.
- In this case we will have to check the search key value of every record in the bucket to get the ones we want.
- Insertion and deletion are simple.
Hash Functions
1. A good hash function gives an average-case lookup that is a small constant, independent of the number of search keys.
2. We hope records are distributed uniformly among the buckets.
3. The worst hash function maps all keys to the same bucket.
4. The best hash function maps all keys to distinct addresses.
5. Ideally, distribution of keys to addresses is uniform and random.
6. Suppose we have 26 buckets, and map names beginning with ith letter of the alphabet to the ith bucket.
- Problem: this does not give uniform distribution.
- Many more names will be mapped to "A" than to "X".
- Typical hash functions perform some operation on the internal binary machine representations of characters in a key.
- For example, compute the sum, modulo # of buckets, of the binary representations of characters of the search key.
- See Figure 11.18, using this method for 10 buckets (assuming the ith character in the alphabet is represented by integer i).
Handling of bucket over flows
1. Open hashing occurs where records are stored in different buckets. Compute the hash function and search
the corresponding bucket to find a record.
2. Closed hashing occurs where all records are stored in one bucket. Hash function computes addresses within that bucket. (Deletions are diffcult.) Not used much in database applications.
3. Drawback to our approach: Hash function must be chosen at implementation time.
- Number of buckets is fixed, but the database may grow.
- If number is too large, we waste space.
- If number is too small, we get too many "collisions", resulting in records of many search key values being in the same bucket.
- Choosing the number to be twice the number of search key values in the file gives a good space/performance tradeo.
