Harvard Computer Science Technical Reports for 2002

TR-01-02 [tr-01-02.ps.gz (184 K)]
Wheeler Ruml. 2002. ``Heuristic Search in Bounded-depth Trees: Best-Leaf-First Search.''
Many combinatorial optimization and constraint satisfaction problems can be formulated as a search for the best leaf in a tree of bounded depth. When exhaustive enumeration is infeasible, a rational strategy visits leaves in increasing order of predicted cost. Previous systematic algorithms for this setting follow a predetermined search order, making strong implicit assumptions about predicted cost and using problem-specific information inefficiently. We introduce a framework, best-leaf-first search (BLFS), that employs an explicit model of leaf cost. BLFS is complete and visits leaves in an order that efficiently approximates increasing predicted cost. Different algorithms can be derived by incorporating different sources of information into the cost model. We show how previous algorithms are special cases of BLFS. We also demonstrate how BLFS can derive a problem-specific model during the search itself. Empirical results on latin square completion, binary CSPs, and number partitioning problems suggest that, even with simple cost models, BLFS yields competitive or superior performance and is more robust than previous methods. BLFS can be seen as a model-based extension of iterative-deepening A*, and thus it unifies search for combinatorial optimization and constraint satisfaction with traditional AI heuristic search for shortest-path problems.
TR-02-02 [tr-02-02.ps.gz (102 K)]
Michael Mitzenmacher and Adrian Perrig. 2002. ``Bounds and Improvements for BiBa Signature Schemes.''
This paper analyzes and improves the recently proposed bins and balls (BiBa) signature, a new approach for designing signatures from one-way functions without trapdoors.

We first construct a general framework for signature schemes based on the balls and bins paradigm and propose several new related signature algorithms. The framework also allows us to obtain upper bounds on the security of such signatures. Several of our signature algorithms approach the upper bound. We then show that by changing the framework in a novel manner we can boost the efficiency and security of our signature schemes. We call the resulting mechanism Powerball signatures. Powerball signatures offer greater security and efficiency than previous signature schemes based on one-way functions without trapdoors.

TR-03-02 [tr-03-02.ps.gz (291 K)]
Jonathan Ledlie, Laura Serban, and Dafina Toncheva. 2002. ``Scaling Filename Queries in a Large-Scale Distributed File System.''
We have examined the tradeoffs in applying regular and Compressed Bloom filters to the name query problem in distributed file systems and developed and tested a novel mechanism for scaling queries as the network grows large. Filters greatly reduced query messages when using Fan's "Summary Cache" in web cache hierarchies, a similar albeit smaller, searching problem. We have implemented a testbed that models a distributed file system and run experiments that test various configurations of the system to see if Bloom filters could provide the same kind of improvements. In a realistic system, where the chance that a randomly queried node holds the file being searched for is low, we show that filters always provide lower bandwidth/search and faster time/search, as long as the rates of change of the files stored at the nodes is not extremely high relative to the number of searches. In other words, we confirm the intuition that keeping some state about the contents of the rest of the system will aid in searching as long as acquiring this state is not overly costly and it does not expire too quickly.

The grouping topology we have developed divides n nodes into log(n) groups, each of which has a representative node that aggregates a composite filter for the group. All nodes not in that group use this low-precision filter to weed out whole collections of nodes by probing these filters, only sending a search to be proxied by a member of the group if the probe of the group filter returns positively. Proxied searches are then carried out within a group, where more precise (more bits per file) filters are kept and exchanged between the n/log(n) nodes in a group. Experimental results show that both bandwidth/search and time/search are improved with this novel grouping topology.

TR-04-02 [tr-04-02.ps.gz (651 K)]
C. A. (Lex) Stein, Michael J. Tucker, and Margo I. Seltzer . 2002. ``Reliable and fault-tolerant peer-to-peer block storage.''
The Charles is a scalable, fault-tolerant, persistent block storage service built from a constantly changing network of potentially faulty or malicious computing nodes. The Charles is designed for a P2P environment: networks characterized by frequent process joins and departures, as well as arbitrary or Byzantine process behavior.

The Charles consists of two levels. The top level provides a scalable lookup service. This level is known as the Surface level and is structured as a ring of slowly changing process groups. The bottom level, known as the Submerged level, combines individual processes together into fault-tolerant groups absorbing network dynamics and process failure; whether it is arbitrary behavior or crash failure. The bottom level serves to provide the top level with predictable and well- behaved process groups. Both levels are self-organizing and self-managing. The Charles has no centralized control.

The Charles combines block lookup with storage. Fault-tolerance is provided through replication within process groups. Replication is maintained internally without any application state or callbacks.

TR-05-02 [tr-05-02.ps.gz (14435 K)]
Luke Hunsberger. 2002. ``Group Decision Making and Temporal Reasoning.''
The more capable and autonomous computer systems become, the more important it is for them to be able to act collaboratively, whether in groups consisting solely of other computers or in heterogeneous groups of computers and people. To act collaboratively requires that systems have effective group decision-making capabilities. This thesis makes four important contributions to the design of group decision-making mechanisms and algorithms for deploying them in collaborative, multi-agent systems. First, it provides an abstract framework for the specification of group decision-making mechanisms that computer agents can use to coordinate their planning activity when collaborating with other agents. Second, it specifies a combinatorial auction-based mechanism that computer agents can use to help them decide, both individually and collectively, whether to engage in a collaborative activity. Third, it extends the theory of Simple Temporal Networks by providing a rigorous theoretical analysis of an important family of temporal reasoning problems. Fourth, it provides sound, complete and polynomial-time algorithms for solving those temporal reasoning problems and specifies the use of such algorithms by agents participating in the auction-based mechanism.
TR-06-02 [tr-06-02.ps.gz (877 K)]
Dan Ellard , Jonathan Ledlie, Pia Malkani , and Margo Seltzer . 2002. ``Everything You Always Wanted to Know about NFS Trace Analysis, but Were Afraid to Ask.''
The past two decades in file system design have been driven by the sequence of trace-based file system studies that have informed the research community of the general access patterns applied to secondary storage systems. Those same two decades have witnessed a radical shift in the types of people who use computer systems and a resulting change in the workloads to which today's systems are subjected. In this paper, we continue the tradition of trace-based studies, but with some important differences. First, we use passive monitoring of NFS traffic to collect our data; we believe that the non-invasive nature of this technique will improve the community's ability to collect a wider range of workloads. Second, we focus on two different types of workloads, the traditional CS departmental load typically used in such studies and the load from a campus-wide, ISP-like environment. As might be expected, these workloads differ dramatically. The departmental workload does not follow previous predictions as directly as might have been expected and the ISP-like workload turns out to provide a detailed analysis of file-system-based mail handling. Our findings demonstrate that the set of static parameters and heuristics that most file systems provide is inadequate to adapt to complex, specialized workloads.
TR-07-02 [tr-07-02.ps.gz (107 K)]
Dan Ellard , David Holland , Nicholas Murphy, and Margo Seltzer . 2002. ``On the Design of a New CPU Architecture for Pedagogical Purposes.''
Ant-32 is a new processor architecture designed specifically to address the pedagogical needs of teaching many subjects, including assembly language programming, machine architecture, compilers, operating systems, and VLSI design. This paper discusses our motivation for creating Ant-32 and the philosophy we used to guide our design decisions and gives a high-level description of the resulting design.
TR-08-02 [tr-08-02.ps.gz (179 K)]
. 2002. ``The Ant-32 Architecture - version 3.1.0b.''
Ant-32 is a new 32-bit CPU architecture designed specifically to address the pedagogical needs of teaching many subjects, including assembly language programming, machine architecture, compilers, operating systems, and VLSI design. This document is a specification for version 3.1.0b the Ant-32 architecture.
TR-09-02 [tr-09-02.ps.gz (285 K)]
Daniel J. Ellard and Penelope A. Ellard . 2002. ``Ant-32 Assembly Language Tutorial (for version 3.1.0b).''
The Ant-32 architecture is a 32-bit RISC architecture designed specifically for pedagogical purposes. It is intended to be useful to teaching a broad variety of topics, including machine architecture, assembly language programming, compiler code generation, operating systems, and VLSI circuit design and implementation. This document contains a brief tutorial for Ant-32 assembly language programming, the assembly language utilities, and a description of the general Ant-32 instruction set architecture.
TR-10-02 [tr-10-02.ps.gz (784 K)]
Avi Pfeffer and Ya'akov Gal. 2002. ``A Language for Descriptive Decision and Game Theory .''
Traditional game-theoretic formalisms, commonly used in multi-agent systems, make the assumptions that beliefs of players are consistent and that that the model of the game is common knowledge among the players. We present Influence Diagram Networks, a language for descriptive decision and game theory that is based on graphical models. Our language relaxes the assumption traditionally used in economics that beliefs of agents are consistent, i.e. conditioned on a common prior distribution. We present an algorithm that computes the actions of agents under the assumption that they are rational with respect to their own model, but not necessarily with respect to the real world. In the single-agent case one can model situations in which the agent has an incorrect model of the way the world works, or in which a modeler has uncertainty about the agent's model. In the multi-agent case, one can model agents' uncertain beliefs about other agents' decision-making models. Applications of our language include determining the cost to an agent of using an incorrect model, opponent modeling in games, and modeling bounded rationality.
TR-11-02 [tr-11-02.ps.gz (155 K)]
Robert Fischer and Margo Seltzer. 2002. ``Clilets: Web Applications with Secure Client-Side Storage.''
Today's web applications require that all data be visible to the server. This is a problem in cases, such as a Web Tax service, where the user may not trust the server with the data. We present the Clilet system, a new web application system that allows sensitive data to be stored securely on the client yet still accessed by the web application. The system ensures that this data is not transmitted to the server, even though no trust is shared between client and server. We have built a working prototype.
TR-12-02 [tr-12-02.ps.gz (76 K)]
Frederic H. Behr, Jr., Victoria Fossum, Michael Mitzenmacher, and David Xiao. 2002. ``Estimating and Comparing Entropy across Written Natural Languages Using PPM Compression.''
Previous work on estimating the entropy of written natural language has focused primarily on English. We expand this work by considering other natural languages, including Arabic, Chinese, French, Greek, Japanese, Korean, Russian, and Spanish. We present the results of PPM compression on machine-generated and human-generated translations of texts into various languages.

Under the assumption that languages are equally expressive, and that PPM compression does well across languages, one would expect that translated documents would compress to approximately the same size. We verify this empirically on a novel corpus of translated documents. We suggest as an application of this finding using the size of compressed natural language texts as a mean of automatically testing translation quality.

TR-13-02 [tr-13-02.ps.gz (343 K)]
Nicholas Elprin and Bryan Parno . 2002. ``An Analysis of Database-Driven Mail Servers.''
This paper compares the performance of three different IMAP servers, each of which uses a different storage mechanism: Cyrus uses a database built on BerekelyDB, Courier-IMAP uses maildirs, and UW-IMAP uses mbox files. We also use a mySQL database to simulate a relational-database-driven IMAP server. We find that Cyrus and mySQL outperform UW and Courier in most tests, often dramatically beating Courier. Cyrus is particularly efficient at scan operations such as retrieving headers, and it also does particularly well on searches on header fields. UW and Cyrus perform similarly on full-text searches, although Cyrus seems to scale slightly better as the size of the mailbox grows. mySQL excels at full-text searches and header retrieval, but it performs poorly when deleting messages. In general, we believe that a database system offers better email storage facilities than traditional file systems.