Harvard Computer Science Technical Reports for 2000

TR-01-00 [tr-01-00.ps.gz (54 K)]
Carl Bosley and Michael O. Rabin. 2000. ``Selecting Closest Vectors Through Randomization.''
We consider the problem of finding the closest vectors to a given vector in a large set of vectors, and propose a randomized solution. The method has applications in Automatic Target Recognition (ATR), Web Information Retrieval, and Data Mining.
Chris Stein. 2000. ``The Write-Ahead File System: Integrating Kernel and Application Logging.''
TR-03-00 [tr-03-00.ps.gz (117 K)]
Michael Mitzenmacher and Andrei Broder. 2000. ``Using Multiple Has Functions to Improve IP Lookups.''
High performance Internet routers require a mechanism for very efficient IP address look-ups. Some techniques used to this end, such as binary search on levels, need to construct quickly a good hash table for the appropriate IP prefixes. In this paper we describe an approach for obtaining good hash tables based on using multiple hashes of each input key (which is an IP address). The methods we describe are fast, simple, scalable, parallelizable, and flexible. In particular, in instances where the goal is to have one hash bucket fit into a cache line, using multiple hashes proves extremely suitable. We provide a general analysis of this hashing technique and specifically discuss its application to binary search on levels with prefix expansion.
TR-04-00 [tr-04-00.ps.gz (133 K)]
Rocco Servedio. 2000. ``Quantum versus Classical Learnability.''
This paper studies fundamental questions in computational learning theory from a quantum computation perspective. We consider quantum versions of two well-studied classical learning models: Angluin's model of exact learning from membership queries and Valiant's Probably Approximately Correct (PAC) model of learning from random examples. We give positive and negative results for quantum versus classical learnability. For each of the two learning models described above, we show that any concept class is information-theoretically learnable from polynomially many quantum examples if and only if it is information-theoretically learnable from polynomially many classical examples. In contrast to this information-theoretic equivalence betwen quantum and classical learnability, though, we observe that a separation does exist between {\em efficient} quantum and classical learnability. For both the model of exact learning from membership queries and the PAC model, we show that under a widely held computational hardness assumption for classical computation (the intractability of factoring), there is a concept class which is polynomial-time learnable in the quantum version but not in the classical version of the model.
TR-05-00 [tr-05-00.ps.gz (53 K)]
Robert Fischer. 2000. ``Multi-Domain Sandboxing: An Overview.''
In today's computing world, computer code is most often developed on one computer and run on another. Code is increasingly downloaded and run on a casual basis, as the line between code and data is blurred and executable code is found in web pages, spreadsheets, word processor documents, etc.

Not having the knowledge or resources to verify the lack of malicious intent of that code, the user must rely on heresay and technological solutions to ensure that casually downloaded code does not damage the user's computer or steal data.

Building on the past concepts of sandboxing and multi-level security, we propose multi-domain sandboxing. This security system allows programs more flexibility than traditional sandboxing, while preventing them from malicious actions. We propose applications of this new technology to the web, increasing the functionality and security possible in web applications.

TR-06-00 [tr-06-00.ps.gz (128 K)]
Michael Mitzenmacher and Sean Owen. 2000. ``Estimating Resemblance of MIDI Documents.''
Search engines often employ techniques for determining syntactic similarity of Web pages. Such a tool allows them to avoid returning multiple copies of essentially the same page when a user makes a query. Here we describe our experience extending these techniques to MIDI music files. The music domain requires modification to cope with problems introduced in the musical setting, such as polyphony. Our experience suggests that when used properly these techniques prove useful for determining duplicates and clustering databases in the musical setting as well.
TR-07-00 [tr-07-00.ps.gz (70 K)]
Michael Mitzenmacher. 2000. ``On the Hardness of Finding Optimal Multiple Preset Dictionaries.''
Preset dictionaries for Huffman codes are used effectively in fax transmission and JPEG encoding. A natural extension is to allow multiple preset dictionaries instead of just one. We show, however, that finding optimal multiple preset dictionaries for Huffman and LZ77-based compression schemes is NP-hard.
TR-08-00 [tr-08-00.ps.gz (130 K)]
Micah Adler and Michael Mitzenmacher. 2000. ``Towards Compressing Web Graphs.''
In this paper, we consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by recently proposed random graph models for describing the Web. The algorithms are based on reducing the compression problem to the problem of finding a minimum spanning tree in a directed graph related to the original link graph. The performance of the algorithms on graphs generated by the random graph models suggests that by taking advantage of the link structure of the Web, one may achieve significantly better compression than natural Huffman-based schemes. We also provide hardness results demonstrating limitations on natural extensions of our approach.