It’s definitely intended to be a percentage. The value is the result of 100 * (currentWork / totalWork). If it ever exceeds 100 it means currentWork > totalWork, which would be a bug.
I assume it only occurs in certain scenarios, I haven’t been able to reproduce it yet.
What would be the likely possible cause of this? I have been regularly getting this while monitoring the progress of the job that is running. Even though it will finish eventually there was one time. Where it got stuck and never finish. I had to issue CancelOperation on it, for the others to come through.
The scheduler that distributes jobs across engines reports progress as 100 * (currentWork / totalWork). Each individual engine reports progress in the range 0-100, and the currentWork value consists of the progress for all engines added together. The totalWork value is initialized to 100 * (number of engines).
Looking at this again, I think what complicates matters is that the scheduler may add additional engines while the job is in progress, which affects currentWork but not totalWork.
I’ll create a ticket for someone to investigate further.
Looking at this again, I think what complicates matters is that the scheduler may add additional engines while the job is in progress, which affects currentWork but not totalWork.
I see, if only totalWork is adjusted dynamically when new engine is created that would solve the inaccuracies.
To me personally it doesnt matter as long as it finishes its task. But I dont want it having an affect that contributes for the hung ups. As long as its doing something and we can see progressively I am good.
But if the hungs up has something to do with this then it becomes an big issue.
Thanks Sander, please do update in this thread. If you find out more about this.
Chiming in on this topic—since implementing health checks, analytics, and a dashboard (wip) to monitor the OLConnect Server, we quickly noticed the same issue.
We’ve observed cases where the progress percentage exceeds 100%, with some Content Creation processes reaching an astonishing 400%.
From what I can tell, this issue appears only during Content Creation operations. However, this might be a sampling bias, as we don’t have similarly heavy datamining, job, or output processes for comparison.
Here’s a snippet of today’s jobs, where some of them exceeded 100%:
This percentage calculation issue is critical for us. We developed the dashboard and monitoring processes because we frequently encounter stalled jobs. Previously, when this happened, we had no way to determine whether the workflow or the server itself was stuck, so our usual approach was to forcefully terminate everything and reboot the VM.
With this monitoring, we hope to:
Gain better insights into what’s happening.
Identify and terminate long-running stuck operations with the cancelOperation endpoints.
Improve overall workflow reliability.
@Sander please keep us in the loop for updates on the subject!
I believe progress reported by the getProgress endpoint can only exceed 100% if the scheduler assigns additional merge engines to a job on the fly. For example, if five jobs are in progress and one of the jobs finishes, the engines that worked on that job could be reassigned to help with the remaining four jobs. This messes up the progress state.
We already made improvements to that code, I happened to give a short internal demo about this earlier today, but I do not expect those improvements to be finalized and released any time soon. This involves changes in the way individual merge engines report progress, and extending the getProgress endpoint to provide additional information - like running totals for records and pages.
I’m glad to hear that some effort is done towards this subject! At the same time I’m sad to acknowledge that a fix will likely be released never or in a few years.
However, thanks Sander for getting back to the community and letting us know the state of issue!