2009-08-19

Open Source Bigtable Analogs

Actually, there's no fully comparable analog to Google's Bigtable - all opensource projects are weaker from performance and stability standpoints. But apparently they are already feasible for managing some unstructured datasets, provided no mission-critical data is involved.

Bellow follows the best 3-minute introduction to the whole concept I've seen online, and then you see some brief product descriptions.
The hardest part about learning HBase (the open source implementation of Google's BigTable), is just wrapping your mind around the concept of what it actually is.

I find it rather unfortunate that these two great systems contain the words table and base in their names, which tend to cause confusion among RDBMS indoctrinated individuals (like myself).

This article aims to describe these distributed data storage systems from a conceptual standpoint. After reading it, you should be better able to make an educated decision regarding when you might want to use HBase vs when you'd be better off with a "traditional" database.


Bigtable

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving).

Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.


HBase

HBase uses a data model very similar to that of Bigtable. Applications store data rows in labeled tables. A data row has a sortable row key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have widely varying numbers of columns.

A column name has the form "<family>:<label>" where <family> and <label> can be arbitrary byte arrays. A table enforces its set of <family>s (called "column families"). Adjusting the set of families is done by performing administrative operations on the table. However, new <label>s can be used in any write operation without pre-announcing it. HBase stores column families physically close on disk, so the items in a given column family should have roughly the same read/write characteristics and contain similar data.

Only a single row at a time may be locked by default. Row writes are always atomic, but it is also possible to lock a single row and perform both read and write operations on that row atomically.


Hypertable

The Hypertable data model consists of a multi-dimensional table of information that can be queried using a single primary key. The first dimension of the table is the row key. The row key is the primary key and defines the order in which the table data is physically stored. The second dimension is the column family. This dimension is somewhat analogous to a traditional database column. The third dimension is the column qualifier. Within each column family, there can be a theoretically infinite number of qualified instances.

For example if we were building a URL tagging service, we might define column families content, url, and tag. Within the "tag" column family there could be an infinite number of qualified instances, such as tag:science, tag:theater, tag:good, etc. The fourth and final dimension is the time dimension. This dimension consists of a timestamp that is usually auto assigned by the system and represents the insertion time of the cell in nanoseconds since the epoch. Conceptually, a table in Hypertable can be thought of as a three dimensional Excel spreadsheet with timestamped versions of each cell.


Cassandra

Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.

Cassandra was open sourced by Facebook in 2008, where it was designed by one of the authors of Amazon's Dynamo. In a lot of ways you can think of Cassandra as Dynamo 2.0. Cassandra is in production use at Facebook but is still under heavy development.

2009-08-13

DLA model parameter space

Here is an image showing how DLA-based fractal model changes when couple of density parameters are slightly tilted. The whole thing is here to generate a fractal probability distribution for geographical Yook/Jeong/Barabasi network model.
Maybe IFS-based fractal would be better here - as it might allow for more control of fractal dimension than DLA does. But then I would not get an obvious way to control exponential density fall-off, and IFS seems to be less natural than this DLA.
Also, here're some resources on fractal generation I'd stumbled upon recently: