Using G1 for Big Data Streams to Cassandra

Background; Adello does mobile ads using a medium size Hadoop cluster (, taking in, among others, the Smaato, Flurry, OpenX, and OMAX feeds. Combined, this is a fair amount of data to handle, about a terabyte per day. Cassandra's turned out to be great for this task, even if we had to fix a race condition and develop new read/write paths from Pig as the existing ones were not able to handle the analysis load (see and

The standard garbage collection config for Cassandra is not bad, but it couldn't quite cope with the 20GB+ heaps we had before we really optimized our Cassandra config. In particular, too much data was pushed into the old space. This creates 30-50 second long stop-the-world GC cycles every 2-4 hours. This was a serious issue when using the token aware writing of Astyanax (, a problem partly solved by optimizing fail-overs, retries, and using Hystrix ( to drop the outliers. Overall, I found Netflix' Github account to be a open-pit gold mine :)

Even though we're no longer using G1, I found the experiences worth sharing. The G1 collector was running in production for a few weeks, and here are our observations:

Initially, it seemed to be a fire-and-forget change; once we switched from CMS, we've immediately saw a major reduction in the time and frequency of long GC cycles.

However, too much data was still pushed into old space. This is caused to the memory-wasteful read/write paths of Cassandra over Thrift.

We soon realized that the time spent on the full cycles increased linearly with the Cassandra's uptime. Over the course of a week, the full GC cycle went from taking 10 seconds to 60 seconds, reclaiming the same amount of memory. This is mostly caused by fragmentation, which G1 was supposed handle better than CMS. Internally, G1 arranges the space into blocks that can be reclaimed/defragmented separately. Clearly, the simplest solution was to restart Cassandra nodes on a regular basis, like Spotify does ( However, with over 4TB per node, Cassandra takes roughly 5 minutes to start, making a serial restart a lengthy process.

To combat the fragmentation issue, we turned to the experimental options for G1. The "-XX:G1NewSizePercent" defaults to 5 and "-XX:G1MaxNewSizePercent" defaults to 60 (they're described here: Given that Cassandra is wasteful with memory, most of the connection-related data is ready to be reclaimed almost instantly. By using "-XX:G1NewSizePercent=60" and "-XX:G1MaxNewSizePercent=95", we keep almost all the data off old space. When using the defaults, the old space would slowly grow to about 12GB, with the new settings, it stayed at about 1GB. Given that the young collection is more parallel, this was huge benefit both for responsiveness and max heap use.

Some notes: setting the "-XX:InitiatingHeapOccupancyPercent" doesn't seem to have much impact. It defaults to 45, but even setting it to 5 or 0 doesn't mean the GC will step in earlier. Reducing the max heap size, however, makes the GC kick in. This even made it possible to run Cassandra with less than 8GB of RAM, but the risk/benefit wasn't there for us. Also, we used "-XX:MaxGCPauseMillis=100", compared to the default of 200. However, this has little impact as G1 typically collected 7GB of new space in 50ms, which is pretty fantastic performance.

Cassandra's been great for the most part. The main problem is that it has an extremely ungraceful degradation. If you leave Cassandra's intended design space, or run with a non-optimized config, it will quickly turn into a constant source of pain. However, once you've figured out the best config for your application, it just keeps running, with minimal management and almost zero stability issues.



  1. It realizes your indiscreet however savvy approach. To remain on the ball, it gains by the investigation. data science course in pune

  2. You might comment on the order system of the blog. You should chat it's splendid. Your blog audit would swell up your visitors. I was very pleased to find this site.I wanted to thank you for this great read!!
    Data Analytics Course in Mumbai

  3. You might comment on the order system of the blog. You should chat it's splendid. Your blog audit would swell up your visitors. I was very pleased to find this site.I wanted to thank you for this great read!!
    data analytics courses online

  4. Actually I read it yesterday but I had some thoughts about it and today I wanted to read it again because it is very well written.
    .Please check ExcelR Data Science Course in Pune

  5. I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post. Hats off to you! The information that you have provided is very helpful.

    data analytics courses

    data science interview questions

    business analytics courses

    data science course in mumbai

  6. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!

    Invisalign orthodontists

  7. Super site! I am Loving it!! Will return once more, Im taking your food additionally, Thanks.
    Know more Data Scientist Courses

  8. The information provided on the site is informative. Looking forward more such blogs. Thanks for sharing .
    Artificial Inteligence course in Patna
    AI Course in Patna

  9. It's really nice and meaningful. it's really cool blog. Linking is very useful thing. You have really helped lots of people who visit blog and provide them useful information.
    More Information of ExcelR

  10. I will really appreciate the writer's choice for choosing this excellent article appropriate to my matter.Here is deep description about the article matter which helped me more.
    I wanted to leave a little comment to support you and wish you a good continuation. Wishing you the best of luck for all your blogging efforts.
    Data Analytics Courses
    I like viewing web sites which comprehend the price of delivering the excellent useful resource free of charge. I truly adored reading your posting. Thank you!

  11. I have express a few of the articles on your website now, and I really like your style of blogging. I added it to my favorite’s blog site list and will be checking back soon…
    Machine Learning Courses in Pune I really enjoy reading and also appreciate your work.

  12. Really immeasurable information for us... Thank you for presenting such wonderful details.
    DevOps Training in Hyderabad
    DevOps Course in Hyderabad

  13. Your article is amazing with smart content, Thank You. Keep updating.
    Python Training in Hyderabad
    Python Course in Hyderabad

  14. Such an informative blog, Keep updating us with such contents.
    Data Science Training in Pune

  15. Thanks a lot for one’s intriguing write-up. It’s actually exceptional. Searching ahead for this sort of revisions. business analytics course in surat


  17. Really good information and useful content. Thanks for sharing with us. Keep doing update blogs more.
    Best Online Courses for Data Science

  18. Thanks for such a great post and the review, I am totally impressed! Keep stuff like this coming.
    data analytics course in hyderabad