EXPEDIA GROUP TECHNOLOGY — ENGINEERING

Upgrading Apache Zookeeper from 3.7.0 to 3.8.0 to Get Rid of the Log4j Vulnerability

Resolving a full disk problem in the upgrade process

Photo by Clément Hélardot on Unsplash

Where are we using Apache ZooKeeper?

We use Apache Ignite for caching in one of our applications at Expedia Group™. We have a distributed Ignite cluster where Ignite nodes discover each other using ZooKeeper. ZooKeeper Discovery is designed for massive deployments where we need ease of scalability and performance where Ignite nodes scale to a number more than 100.

We run ZooKeeper as a Docker container on EC2 machines.

Source: Ignite documentation

Please find more details about Apache Ignite usage in our application here.

Why upgrade to Zookeeper version 3.8.0 from 3.7.0?

Due to the recent issues with Log4j vulnerabilities, ZooKeeper migrated to another matured library called Logback. This migration was part of the ZooKeeper version 3.8.0 release. So, to get rid of Log4j vulnerabilities in our application, we thought of migrating to version 3.8.0 from 3.7.0

We upgraded ZooKeeper and the cluster nodes went down. The EC2 machine on which the ZooKeeper container was running went out of space and confused us.

What was the root cause for the disk out-of-space issue?

By default, ZooKeeper with the Logback configuration uses the console appender for logging purposes. So all logs regarding the Ignite nodes joining the ZooKeeper cluster, leaving the cluster, etc. were flushed to the standard output. Since we were running ZooKeeper as a Docker container, Docker’s logging driver captured the standard output and standard error of the container and wrote them in files using the JSON format. By default, no log rotation is performed by Docker’s logging driver. As a result, log files stored by the default JSON-file logging driver caused a significant amount of disk space to be used, which led to disk space exhaustion on the EC2 machine where ZooKeeper was running as a container.

The unavailability of disk space led to the killing of the ZooKeeper container, leading to the downing of the whole ZooKeeper cluster. Since Ignite nodes were using ZooKeeper to discover themselves, with the downing of the ZooKeeper cluster the Ignite nodes were unable to discover themselves. Hence the whole caching infrastructure went down.

How did we resolve the disk out-of-space issue?

1. We used logging options provided by Docker’s JSON-file logging driver.

Docker by default uses the JSON-file logging driver for file-based storage of logs. By default, it provides no log rotation. But it does have some logging options to support log rotation which need to be explicitly passed as parameters while running a Docker container.

  • max-size: The maximum size of the log before it is rolled. A positive integer plus a modifier representing the unit of measure (k, m, or g). Defaults to -1 (unlimited).
  • max-file: The maximum number of log files that can be present. If rolling the logs creates excess files, the oldest file is removed. Only effective when max-size is also set. A positive integer. Defaults to 1.

Running the above command starts a zookeeper container which can have a maximum of 3 log files no larger than 10 megabytes each.

2. Configure Logback with Rolling File Appender

Since ZooKeeper version 3.8.0 is using Logback for logging purposes, we can configure it to use RollingFileAppender. RollingFileAppender appends log events into a file with the capability to roll over (archive the current log file and resume logging in a new file) based on a particular schedule, such as daily, weekly, monthly, or based on log file size. In our scenario, we need to use a rollover policy based on log size.

The above Logback configuration creates rolled-over logs with a max log size of 2 MB, but at most 20 log files. Older log files start getting deleted as soon as the max backup index limit is breached.

Wrapping up

Post-using the policy for rolled-over logs with a limit on the number of log files created with specific data size, we were able to control the disk space consumption w.r.t logs, leading to a reliable state of the system.

Coauthored with rohit goel

I hope this has been useful to you all. Thanks for reading.

Learn more about technology at Expedia Group

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store