Workflow Process Design

TDGreer · April 1, 2021, 5:33pm

More, or less?

Putting aside questions of maintainability, and focusing solely on performance, is it better to have multiple (very similar) processes, or better to consolidate them all into a single process?

At a large government print facility that does print and mail operations for all of the various state agencies, we have an architecture where each Agency has their “Folder” in Workflow, with a dedicated process to produce print and PDF output, and individual processes that do any job-specific preprocessing tasks on the data files before submitting them into the “dedicated print output” process.

Many of these processes are very similar, and the question came up, would it be better to consolidate them into one “master process”?

Phil · April 1, 2021, 6:00pm

In terms of pure performance, separate processes will fare (very) marginally better.
I doubt the difference would be measurable, though.

Phil · April 1, 2021, 8:08pm

Sorry, was just about to jump in a meeting so I answered a bit quickly. Reading back, I realize that I should have been a little more thorough in my answer…

Whenever a workflow process is triggered by new files, it takes a snapshot of all files in the monitored folder. It then processes that list and once it’s done, it goes back one more time to see if new files have come in while the processing was going on. It builds a new snapshot of all the files, processes them and the entire procedure is repeated until no more files are found.

Imagine that this process is not self-replicating: that means each file will be processed sequentially. In that case, it would be much more efficient to have separate processes monitoring their own folder because then you would process all folders in parallel. I think that’s a pretty clear cut argument in favor of having multiple processes.

But if the process is self-replicating (for argument’s sake, let’s say it is set to have up to 10 self-replicated instances and the folder contains 100 files), then after the snapshot has been taken, 10 instances of the process are created and each of them is handed 10 files. Which means the files are processed in parallel. Now what’s the difference between those 10 self-replicated processes, and 10 separate processes? Well… not much. As I stated in my initial answer, the self-replicated process might be very slightly slower than a standard process because it first has to be cloned before it can start processing its share of files. The cloning procedure takes milliseconds… at most.

That’s why I stated having multiple separate processes may fare marginally better, but not by much.

I personally prefer to use less processes and rely heavily on the self-replication feature. It makes managing and maintaining processes much easier. However, if using a single self-replicating process means that you have to add several conditions inside that process to account for the variations in processing different types of data files, then you might start seeing a more obvious difference in performance.

That’s because every condition takes milliseconds to resolve (and sometimes it can be several milliseconds, for instance if the condition needs to extract some content out of a PDF). For a single file, it won’t make much of a difference. But after a few thousand files, those extra operations add up.

In that case, you’re better off with separate processes.

Not the clear-cut answer you were expecting, I’m sure, but at least now you have more info on which to base your implementation.

TDGreer · April 1, 2021, 8:33pm

Very thorough, and in line with what I was thinking. We already have a “genericConnect” process in production, self-replicating. I created a web front-end for creating trigger files for this process. The trigger files specify all Connect resources, and folder locations for input and output. The process handles the most common tasks, including outputting a print stream, an archive copy, and separated PDF output to return to the client agencies for their own archival and customer-support needs (as well as data validation and job status reporting).

I think an architecture where new, automated jobs, have a dedicated process that captures the data file(s), applies any pre-processing required, and then creates a trigger file to run the “genericConnect” process, is a very good architecture. It allows job customization while keeping standards in place. Code reuse, less to maintain… all good things.

Thank you for the input!