From: Ian Lynagh Date: Thu, 30 Aug 2007 14:28:44 +0000 (+0000) Subject: Better hash functions for Data.HashTable, from Jan-Willem Maessen X-Git-Tag: 2007-09-13~6 X-Git-Url: http://git.megacz.com/?a=commitdiff_plain;h=ea6bc8e84134b31e1548df5879d2695ddd60b1cb;p=ghc-base.git Better hash functions for Data.HashTable, from Jan-Willem Maessen --- diff --git a/Data/HashTable.hs b/Data/HashTable.hs index 391876f..34a6600 100644 --- a/Data/HashTable.hs +++ b/Data/HashTable.hs @@ -170,7 +170,7 @@ recordLookup = instrument lkup -- stats :: IO String -- stats = fmap show $ readIORef hashData --- ----------------------------------------------------------------------------- +-- ---------------------------------------------------------------------------- -- Sample hash functions -- $hash_functions @@ -180,41 +180,71 @@ recordLookup = instrument lkup -- function therefore will give an even distribution regardless of /n/. -- -- If your keyspace is integrals such that the low-order bits between --- keys are highly variable, then you could get away with using 'id' +-- keys are highly variable, then you could get away with using 'fromIntegral' -- as the hash function. -- -- We provide some sample hash functions for 'Int' and 'String' below. golden :: Int32 -golden = -1640531527 +golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32 +-- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32 +-- but that has bad mulHi properties (even adding 2^32 to get its inverse) +-- Whereas the above works well and contains no hash duplications for +-- [-32767..65536] + +hashInt32 :: Int32 -> Int32 +hashInt32 x = mulHi x golden + x -- | A sample (and useful) hash function for Int and Int32, --- implemented by extracting the lowermost 32 bits of the --- result of multiplying by a 32-bit constant. The constant is from +-- implemented by extracting the uppermost 32 bits of the 64-bit +-- result of multiplying by a 33-bit constant. The constant is from -- Knuth, derived from the golden ratio: +-- > golden = round ((sqrt 5 - 1) * 2^32) +-- We get good key uniqueness on small inputs +-- (a problem with previous versions): +-- (length $ group $ sort $ map hashInt [-32767..65536]) == 65536 + 32768 -- --- > golden = round ((sqrt 5 - 1) * 2^31) :: Int hashInt :: Int -> Int32 -hashInt x = fromIntegral x * golden +hashInt x = hashInt32 (fromIntegral x) -- hi 32 bits of a x-bit * 32 bit -> 64-bit multiply mulHi :: Int32 -> Int32 -> Int32 mulHi a b = fromIntegral (r `shiftR` 32) - where r :: Int64 - r = fromIntegral a * fromIntegral b :: Int64 + where r :: Int64 + r = fromIntegral a * fromIntegral b -- | A sample hash function for Strings. We keep multiplying by the -- golden ratio and adding. The implementation is: -- --- > hashString = foldl' f 0 --- > where f m c = fromIntegral (fromEnum c + 1) * golden + mulHi m golden +-- > hashString = foldl' f golden +-- > where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m +-- > magic = 0xdeadbeef +-- +-- Where hashInt32 works just as hashInt shown above. +-- +-- Knuth argues that repeated multiplication by the golden ratio +-- will minimize gaps in the hash space, and thus it's a good choice +-- for combining together multiple keys to form one. -- --- Note that this has not been extensively tested for reasonability, --- but Knuth argues that repeated multiplication by the golden ratio --- will minimize gaps in the hash space. +-- Here we know that individual characters c are often small, and this +-- produces frequent collisions if we use fromEnum c alone. A +-- particular problem are the shorter low ASCII and ISO-8859-1 +-- character strings. We pre-multiply by a magic twiddle factor to +-- obtain a good distribution. In fact, given the following test: +-- +-- > testp :: Int32 -> Int +-- > testp k = (n - ) . length . group . sort . map hs . take n $ ls +-- > where ls = [] : [c : l | l <- ls, c <- ['\0'..'\xff']] +-- > hs = foldl' f golden +-- > f m c = fromIntegral (fromEnum c) * k + hashInt32 m +-- > n = 100000 +-- +-- We discover that testp magic = 0. + hashString :: String -> Int32 -hashString = foldl' f 0 - where f m c = fromIntegral (ord c + 1) * golden + mulHi m golden +hashString = foldl' f golden + where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m + magic = 0xdeadbeef -- | A prime larger than the maximum hash table size prime :: Int32