14th String Processing and Information Retrieval Symposium Santiago, Chile, October 2007
Minimal perfect hash functions are widely used for memory efficient storage and fast retrieval of items from static sets, such as words in natural languages, reserved words in programming languages or interactive systems, universal resource locations (URLs) in web search engines, or item sets in data mining techniques. A perfect hash function (PHF) for a key set is a function that maps the keys of to unique values. A minimal perfect hash function (MPHF) is a PHF with the smallest range, i.e., . The minimum amount of space to represent a PHF for a given set is known to be approximately bits, where . The objective of this talk is to present time efficient and near space-optimal perfect hashing algorithms. We will present: (i) a family of internal memory based algorithms that assume uniform hashing to build PHFs (for ) and MPHFs (for ) based on random graphs, and (ii) an external memory based algorithm that has experimentally proven practicality for sets in the order of billions of keys. Both internal and external algorithms have the following properties: (i) evaluation of a PHF or a MPHF requires constant time, (ii) the algorithms are simple to describe and implement, and generate the functions in linear time, (iii) the amount of space needed to represent a PHF is bits and a MPHF is bits, which is around a factor of 2 from the information theoretical minimum of approximately and bits, respectively. To our knowledge, no previously known algorithm has these properties and any algorithm in the literature with the third property either requires exponential time for construction and evaluation, or uses near-optimal space only asymptotically, for extremely large . We demonstrate the scalability of the external memory based algorithm by constructing minimum perfect hash functions for a set of 1.024 billion URLs from the World Wide Web of average length 64 characters in approximately 62 minutes, using a commodity PC.
Nivio Ziviani has a Ph.D. in Computer Science from the University of Waterloo, Canada, 1982. He is a Professor of the Department of Computer Science of the Federal University of Minas Gerais (UFMG), Brazil, where he coordinates the Laboratory for Treating Information (LATIN). He is a Professor Emeritus of the Institute of Exact Sciences of UFMG. He is a co-founder of Miner Technology Group, sold to Folha de São Paulo / UOL group in 1999, and Akwan Information Technologies, sold to Google Inc. in 2005. He has co-authored of over 100 refereed papers and 2 books in the areas of algorithm design and information retrieval, the latter his primary area of research. He was General Co-Chair of the 28th ACM SIGIR Conference on Research and Development in Information Retrieval and co-founder of the International Conference on String Processing and Information Retrieval (SPIRE).