I am faced with an interesting problem. I need to store timestamped logs in Cassandra. In other words, I need to store entries like this in a column family:
| Timestamp | User | Message |
|---|---|---|
| 123456678 | SomeUser | SomeMessage |
Not being an expert on Cassandra, my natural inclination was to first try this set up. Row key is a timestamp, user and message are columns. Recording the data worked just fine until I started querying it: range queries by timestamp became impossible, and ordering of rows appeared very random.
After some research I realized that my Cassandra set up uses RandomPartitioner which distributes rows across the cluster evenly by md5 hash. Cassandra manual says "when in doubt, this is the best option." This works great for load balancing, but makes range slice queries impossible.
I pondered this and decided for my purposes load balancing is more important than log file ordering, so I came up with a different solution. Here, the logs are stored in a super column family with date as a key. Each super column is a timestamp with milliseconds from the start of the day (to save disk space), and sub columns are user and message. This works well for what I need at the moment and on the plus side partitions logs by date:
| Date | Log Entries | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2011-12-22 |
|
||||||||||||
Link: Cassandra: RandomPartitioner vs OrderPreservingPartitioner