SPIRE 2007
14th String Processing and Information Retrieval Symposium Santiago, Chile, October 2007
Minimal perfect hash functions are widely used for memory efficient storage
and fast retrieval of items from static sets, such as words in natural languages,
reserved words in programming languages or interactive systems, universal
resource locations (URLs) in web search engines, or item sets in data mining
techniques. A perfect hash function (PHF)
for a key set
is a function that maps the keys of
to unique values. A minimal perfect hash function (MPHF) is a PHF
with the smallest range, i.e.,
.
The minimum amount of space to represent a PHF for a given set
is known to be approximately
bits,
where
. The objective of this talk is to present time efficient
and near space-optimal perfect hashing algorithms. We will present: (i) a family of internal memory based algorithms that assume uniform hashing
to build PHFs (for
) and MPHFs
(for
) based on random graphs, and (ii) an external memory based algorithm
that has experimentally proven practicality for sets in the order of billions
of keys. Both internal and external algorithms have the following properties:
(i) evaluation of a PHF or a MPHF requires constant time, (ii) the algorithms
are simple to describe and implement, and generate the functions in linear
time, (iii) the amount of space needed to represent a PHF is
bits and a MPHF
is
bits, which is around a factor of 2 from the information theoretical
minimum of approximately
and
bits, respectively. To our knowledge, no previously known
algorithm has these properties and any algorithm in the literature with
the third property either requires exponential time for construction and
evaluation, or uses near-optimal space only asymptotically, for extremely
large
.
We demonstrate the scalability of the external memory based algorithm by constructing minimum perfect hash functions for a set of 1.024 billion URLs from the World Wide Web of average length 64 characters in approximately 62 minutes, using a commodity PC.
Nivio Ziviani has a Ph.D. in Computer Science from the University of Waterloo, Canada, 1982. He is a Professor of the Department of Computer Science of the Federal University of Minas Gerais (UFMG), Brazil, where he coordinates the Laboratory for Treating Information (LATIN). He is a Professor Emeritus of the Institute of Exact Sciences of UFMG. He is a co-founder of Miner Technology Group, sold to Folha de São Paulo / UOL group in 1999, and Akwan Information Technologies, sold to Google Inc. in 2005. He has co-authored of over 100 refereed papers and 2 books in the areas of algorithm design and information retrieval, the latter his primary area of research. He was General Co-Chair of the 28th ACM SIGIR Conference on Research and Development in Information Retrieval and co-founder of the International Conference on String Processing and Information Retrieval (SPIRE).