Server apparently "sleeping"

We set up an on-demand document composition server which is controlled through OL-Connect APIs only.

The server seems to be well tuned since during stress tests with multiple robots requesting multiple documents the 90th percentile falls within a few seconds of generation, between 2 and 8 seconds per document which is very acceptable.

However, I am noticing that sometime some requests take way more like 20/30 seconds, replaying the same request at a later time however results in a 3/5 seconds latency.

Given that the same request one time take 5 seconds and another time 30 seconds I suspect the Server goes to a “sleeping” state.
Metrics for each step (datamining, content creation, job, output) shows that the whole process takes longer like if each engine isn’t in a ready state.

Is this behaviour documented? Any tips on how to further troubleshoot the issue or optimizations on the Server settings?

Server version: 2024.1

There is no sleeping state to speak with, unless the underlying operating system forces it.

The two obvious suspects are the Java garbage collector and the Cleanup Service.

The Java garbage collector periodically runs at any given time, without intervention. While it is running, the system may experience severe delays. In most cases it should be nearly unnoticeable. But perhaps there is a usage pattern, such as high volume of requests, which worsen the condition.

The Cleanup Service is a facility of Connect that periodically cleans up the back-end database from entities (data mapping records, intermediate content items, etc.) which are no longer in use. Its schedule is configured in the Server Congifuration application. During the time that it is running, the database can go under heavy load, which makes the overall system slower. Entries about the Cleanup Service being run should be visible in the main Server log, and it should be happen at predicable times (e.g. every 8 hours).

I had a situation where we restart the engines every day at 00:01 and the first Template to go through would take more than 30 seconds to execute. Since they are often triggerred by HTTP request, it would end-up in time-out from the user perspective.

I had to setup a process that runs every day at 1:15 am and generate a PDF. That did it for me.

The current clean-up service is configured as follows, should it be set to a higher interval?

Anyway as of now the clean-up should run at 14:00 every day, but the high latency happens at any time of the day as long as few hours pass since the last request. I am trying to understand how much time is enough to cause a lag spike.

As of now there are at most 20 request per day since we are still in a staging phase so for now usage patterns are to be excluded. However, synthetic stress tests with robots shows that the server handles well hundreds of requests (that’s where I built the latency statistics).


That could work for me too, as a workaround, but I need to better understand after how much time a lag spike occurs.
I’ll make a program to run a request at increasingly longer time delays, hopefully at some point the lag spike will show up and I will now how often to run a fake request.

So the Cleanup Service schedule is not involved. It was worth a shot.

Might there be some correlation with the Engine automatic restarts?

  • I see it is configured every 60 minutes, does it mean at minute 00 of every hour? If not, when does the countdown start?
    Side note: the lag spikes occurred at any minute, not at 00 of every hour. Maybe the 60 minutes countdown is since the last engine activity? That would stagger the lag spikes at different times.

  • Why would I want to use a rolling restart?

Current configuration

New testing configuration

I read the relevant documentation but I feel like it’s missing some examples or at least what would be the reasoning to be applied to choose the correct settings.

https://help.uplandsoftware.com/objectiflune/en/olconnect/2024.1/ServerConfig/AutomaticRestart_preferences.html?Highlight=automatic%20restart


Just to be safe I moved the cron to 1 o’clock in the night.

I believe you may have found it.

Connect is written in Java. As you may know, a Java program is delivered in an intermediate language which gets compiled on-the-fly (i.e. while it is running, not at startup or installation time) for the native hardware on which it is installed by a Just-In-Time compiler (a.k.a. JIT). During the time that the program is not yet compiled, it runs as an interpreted language, which incurs an overhead. Hence why Java programs get faster over time – the JIT improves the in-memory program as it executes.

The restart wipes out the JIT code from memory when the instance is shut down, and the JIT process has to start over when the new instance starts. During this time, the interpreted bytes run instead of the compiled code, hence the slower runs at first which become fast over time.

The restart mechanism was introduced when Connect was much less mature than it is today. It improved the general robustness of the system, e.g. by making sure that all memory is periodically reclaimed.

You can probably get away with increasing the restart interval, perhaps even scheduling it daily during the night. At the very least, you should increase the restart period. I believe that the default is 480 minutes. Not sure why you have 60.

Note: The restart after (minutes) refers to the time elapsed since the instance startup, not a time within the hour. Rolling restart means that instances are restarted one after the other instead of all at once. This is usually preferrable as it avoids a down time.