Background; Adello does mobile ads using a medium size Hadoop cluster (http://www.technavio.com/blog/top-16-hadoop-technology-companies), taking in, among others, the Smaato, Flurry, OpenX, and OMAX feeds. Combined, this is a fair amount of data to handle, about a terabyte per day. Cassandra's turned out to be great for this task, even if we had to fix a race condition and develop new read/write paths from Pig as the existing ones were not able to handle the analysis load (see https://issues.apache.org/jira/browse/CASSANDRA-6788 and https://issues.apache.org/jira/browse/CASSANDRA-6665).
The standard garbage collection config for Cassandra is not bad, but it couldn't quite cope with the 20GB+ heaps we had before we really optimized our Cassandra config. In particular, too much data was pushed into the old space. This creates 30-50 second long stop-the-world GC cycles every 2-4 hours. This was a serious issue when using the token aware writing of Astyanax (https://github.com/Netflix/astyanax), a problem partly solved by optimizing fail-overs, retries, and using Hystrix (https://github.com/Netflix/hystrix) to drop the outliers. Overall, I found Netflix' Github account to be a open-pit gold mine :)
Even though we're no longer using G1, I found the experiences worth sharing. The G1 collector was running in production for a few weeks, and here are our observations:
Initially, it seemed to be a fire-and-forget change; once we switched from CMS, we've immediately saw a major reduction in the time and frequency of long GC cycles.
However, too much data was still pushed into old space. This is caused to the memory-wasteful read/write paths of Cassandra over Thrift.
We soon realized that the time spent on the full cycles increased linearly with the Cassandra's uptime. Over the course of a week, the full GC cycle went from taking 10 seconds to 60 seconds, reclaiming the same amount of memory. This is mostly caused by fragmentation, which G1 was supposed handle better than CMS. Internally, G1 arranges the space into blocks that can be reclaimed/defragmented separately. Clearly, the simplest solution was to restart Cassandra nodes on a regular basis, like Spotify does (http://www.slideshare.net/planetcassandra/8-axel-liljencrantz-23204252). However, with over 4TB per node, Cassandra takes roughly 5 minutes to start, making a serial restart a lengthy process.
To combat the fragmentation issue, we turned to the experimental options for G1. The "-XX:G1NewSizePercent" defaults to 5 and "-XX:G1MaxNewSizePercent" defaults to 60 (they're described here: http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html). Given that Cassandra is wasteful with memory, most of the connection-related data is ready to be reclaimed almost instantly. By using "-XX:G1NewSizePercent=60" and "-XX:G1MaxNewSizePercent=95", we keep almost all the data off old space. When using the defaults, the old space would slowly grow to about 12GB, with the new settings, it stayed at about 1GB. Given that the young collection is more parallel, this was huge benefit both for responsiveness and max heap use.
Some notes: setting the "-XX:InitiatingHeapOccupancyPercent" doesn't seem to have much impact. It defaults to 45, but even setting it to 5 or 0 doesn't mean the GC will step in earlier. Reducing the max heap size, however, makes the GC kick in. This even made it possible to run Cassandra with less than 8GB of RAM, but the risk/benefit wasn't there for us. Also, we used "-XX:MaxGCPauseMillis=100", compared to the default of 200. However, this has little impact as G1 typically collected 7GB of new space in 50ms, which is pretty fantastic performance.
Cassandra's been great for the most part. The main problem is that it has an extremely ungraceful degradation. If you leave Cassandra's intended design space, or run with a non-optimized config, it will quickly turn into a constant source of pain. However, once you've figured out the best config for your application, it just keeps running, with minimal management and almost zero stability issues.