SOLR Optimizations

Shiksha Engineering
5 min readSep 25, 2023

Author : Sufiyan Ali / Ashar Ekram

In Shiksha, in the process of migrating from solr 5.x to solr 8.x, we were getting the following types of exceptions:

IO Exceptions occured when talking to server at: http://solrDNS:port

Initial findings revealed Solr GC issues which resulted in stopping the Solr application thereby resulting in the above type of exceptions.

Graph showing GC metrics at the time of exceptions
Graph showing heap usage at the time of exceptions
Graph showing heap usage at the time of exceptions

First approach:

At first look, it seems the Young Generation Memory was not allocated properly, so we decided to increase the Young Generation Memory.

For this we tries setting below property:

  • XX:MaxNewSize=14g

However, this resulted in the occurrence of Full GCs(running Old GCs) leading to a more negative impact on the performance and responsiveness of the application.

Note: Don’t play with this parameter of GC. Let GC set this parameter on its own.

Second approach:

By looking at the GC Pause Duration, we noticed that whenever there is a high GC Pause time, the application is unresponsive.

And we noticed that the high GC pause time(assuming above 500 ms) occurred only 8 times(As shown in the image below)

So, we tried to limit the GC Pause time by setting -XX:MaxGCPauseMillis=400

But this didn’t work out and we were getting the same type of exceptions.

Third approach:

We checked heap usage for Solr. For reference, attached one.

At the peak of the spike, there was always a G1 Humongous Allocation event which was causing the GCs to trigger very frequently.

Humongous allocations: A humongous object in G1 terminology refers to an object that’s larger than 50% of the size of a region in the heap. In the G1 GC, the heap is divided into equally sized regions, and large objects that can’t fit into a single region (or are larger than half the size of a region) are termed as “humongous”. Such objects are specially handled because they can cause fragmentation and inefficiencies if managed like smaller objects.

The reason for this special handling is that trying to fit such objects in the regular allocation process can lead to significant fragmentation in the heap, which can subsequently degrade performance. Instead, G1 will allocate humongous objects in a contiguous set of regions called humongous regions.

Example: Suppose you have a JVM where the heap region size is determined (based on your total heap size) to be 2MB. If you try to allocate an object of size 1.5MB (which is more than 50% of 2MB), then this object will be considered humongous. When you request such an object, G1 will allocate it to the so-called humongous regions.

public class HumongousExample {
// Roughly 1.5 MB
static final int SIZE = (1_500_000 - 16) / 4;

public static void main(String[] args) {
int[] largeArray = new int[SIZE];

// This largeArray is potentially a humongous object depending on the G1 region size
}
}

For our solr application, we have:

  • Xms17240m -Xmx20240m

By default region size is calculated during startup based on the heap size

i.e. region size = startingHeapSize/2048 ~ 8MB

So, we tried to increase the G1 region size so that allocations would not exceed 50% limit. It can be overridden by specifying ‘-XX:G1HeapRegionSize’ property. So, we set

-XX:G1HeapRegionSize=16m

Setting this property increased the region size, but at the same time reduced the number of regions. This resulted in triggering mixed GCs. This also doesn’t help us in fixing the issue.

At this point in time, we have figured out that we have to solve the Humongous Allocation issue.

Also, we noticed that some solr queries were giving the response in the around the same amount of time when they were executed again and again i.e. queries were not getting cached.

Optimizing Code:

We reviewed and tried to optimize code to reduce object creation and minimize unnecessary memory usage. For this, the following optimizations were done:

  • Removed group.limit={someVeryBigNo} in Solr query which was getting data of more than 10 MB in a single Solr query and added pagination.
  • Replaced group.limit=-1 in solr queries with some defined values.

After removing -1 in group.limit, one important point we observed that queries were started getting cached, and after the first hit, all the subsequent hits of that solr query gave response in around 0ms.

The above heap usage graph shows, that there was no humongous allocation after the above-mentioned code optimizations.

This also improved Solr CPU usage and memory.

Conclusion:

  • Though there is no clear documentation around the usage of cache whenever we are using group.limit as -1, we experienced a performance hit due to this.
  • Also, setting group.limit to -1 can potentially return a large number of documents, so we should use it with caution, especially if the Solr index contains a substantial amount of data. Be mindful of the impact on query performance and the amount of data transferred over the network when using group.limit=-1

--

--