While Instalink provides great flexibility in designing processes and data flows, it is important to remember that the amount of data that can be loaded into a processor is not limitless. Instalink generally prioritizes processing speed over memory management, so it is necessary to bear this in mind when creating data flows that may potentially ingest large amounts of data.
Ensuring a memory efficient Instalink data flow is a simple matter of calculating Time and Space. Time being the amount of time it takes to process data and Space being the size of the data being processed. It is possible that the process will become inefficient if either metric becomes too large.
Where possible, data flows should be lightweight and only process the data needed at the time the data is needed. Following some basic design rules can help ensure that your processes always run smoothly and without interruption.
Small Inputs, Small Outputs: Keep Data Size Consistent
The amount of data that is being processed at any given moment on the processor should be as small as possible, and it should be processed as fast as possible. The processor will hold on to the data for as long as it is being processed. This means that the memory is used and cannot be deallocated to make room to handle other tasks. So the larger the data is, the longer it is being processed, the more memory the processor will consume. The processor will reboot and cancel currently running data flows once all 1.7gb of available RAM is allocated and nothing is available to deallocate. This situation must be avoided.
Keep Input Data Small
Larger data sets necessitate more computational time which keeps the memory allocation on the stack for longer. It is better to load a single record and process it extremely quickly than to load 5000 records at once and process them all slowly. Transformations that work well with small data sets may not scale well to large data. For example, most transforms have operational complexity that scales in direct proportion to the amount of data being processed. But some transforms have operational complexity that scales at factors much higher than the amount of data being processed.
In the following transform, the processing time will scale linearly with the number of records in VALUE. It should be just as fast to run this transform on a large data set as it would be to run it multiple times on smaller data sets.
# CPU Time scales constantly according to the size of the input array # CPU Time = number of records in VALUE * 1 ARRAY_LENGTH( VALUE )Even though that example transform scales linearly, it is important to remember that the memory for large data sets is still allocated for that to run. Breaking the data into smaller chunks would still be more memory efficient even though it would process slower because of the latent time it takes to instantiate new transform actions.
However, in the following transform, the processing time becomes much greater as the size of the input increases. Running this script on large inputs could be exponentially slower than running it on smaller data.
# CPU Time is loglinear which means that it will run a logarithmic function # of the number of records for every record in the input array. # CPU Time = log(number of records in VALUE) * number of records in VALUE ARRAY_SORT( VALUE )In the above example, the CPU time gets much slower the larger the input data is. At small scales, the difference is negligible, but processing very large data sets with this transform can cause the process to run slowly which in turn causes the processor to not release the memory allocation in a timely manner. Doing many such transforms on large data will compound the problem. Keeping record counts low mitigates much of this issue.
Keep data inputs as small as possible and be aware of the time complexities of data transformations when processing larger data sets.
Keep Output Data Small
When designing data flows, it is very important to keep the size of the process data consistent throughout the entire flow.
For example, sending 100 records to a data flow should result in approximately 100 records being processed and returned in the data flow. This ensures that the memory footprint of the process does not grow during processing. Any situation where the data grows during processing can cause unexpected spikes in memory usage.
Here is an example of a data flow that keeps the data size consistent:
- Endpoint receives 100 records.
- Data flow then sends all 100 records in a batch to an external endpoint.
- Then it responds to the initial caller with the 100 records.
In that example, the data size never increases dramatically because the data flow is designed to ensure that the process data scales linearly. The process data output is more-or-less the same size as the input. This makes it easier to anticipate the behavior of the flow and ensure that it can be implemented in a way that scales linearly with the number of records processed.
The following is an example of a data flow that grows greatly during processing. This type of data flow should be avoided where possible:
- Endpoint receives 100 records.
- Then for each record a call is sent to an external endpoint which returns 100 new records.
- The data flow then responds to the initial caller with 10000 records.
This simple example demonstrates how quickly the memory space of a flow can be inflated by including actions which load more data onto the stack. A single run of this data flow increases the memory by 100 times while it is processing. This makes it more difficult to scale the call to larger data inputs. Running this data flow 5 times at once with 100 records each time would put 50000 records worth of data on the processor's memory stack. At large scale, this design pattern could eventually require more allocation than the processor has available.
Implement design patterns that maintain a consistent data size throughout the course of processing.
Don't Allow Unbounded Data
A data flow should always be designed to anticipate a specific size of data input. Don't allow a caller to send in unlimited numbers of records or data. Place maximum size restrictions on endpoints and design the data flow to run efficiently at that size.
For example, an endpoint set up to receive a maximum of 10 records, each of a certain size, will never exceed the anticipated memory allocation on any individual run of the data flow. If the data flow also doesn't load in any unbounded data during processing, it should be expected that every run of the flow will require similar memory allocations. This makes scaling more straightforward. It also makes it easier to handle spikes in processing demands.
Conversely, an endpoint set up to allow an unlimited number of records will be difficult to maintain efficient and consistent processing times and allocations. The caller could potentially send in unexpectedly high number of records at any time and subsequently overwhelm the data flow if it was not designed to handle that number of records.
Data flows that allow unbounded data are nearly impossible to scale accurately with confidence. Any spiking process could cause the CPU time to increase and the memory allocation to increase to the point where processing speeds slow considerably. Multiple unbounded data flows running simultaneously on a processor will also take resources from each other which will in turn cause the data flows to run even slower. This inefficiency is compounded the more a processor handles data with inconsistent sizes.
Use Process Queues
Long-running data flows with multiple calls to external endpoints should implement process queues. Consider the following data flow:
- Endpoint receives 100 records.
- Then for each record a call is sent to an external endpoint.
- Then data flow responds to the caller.
Each call that is sent to the external endpoint causes the data flow process to wait for a response. The allocated memory stays on the stack during the entire time spent waiting. If each call takes 2 seconds to return a response, this means that the data flow will stay active for a minimum of 200 seconds (3.3 minutes). This is problematic because the memory doesn't get released until the final outbound call completes. The processor will hold on to the data while doing nothing except waiting for a response. Running multiple instances of this data flow at the same time will cause the processor to lock up resources that could be used for other tasks in the project.
Process queues should be used to allow the processor to release the memory before making the calls to the external endpoint. Using the process queue also allows the processor to seamlessly handle other tasks while waiting to send the next call.
Here is how the data flow should be reorganized using a process queue:
- Endpoint receives 100 records
- Then each record is sent to a process queue.
- A response is sent to the caller once all items are queued.
- A call is sent to an external endpoint when each queue item runs.
The process queue will also only load the data that is stored in the queue. So it is necessary to only store the data that the queue process needs. In the above example, each queue item should only include a single record. This keeps the data size small, consistent, and ensures that memory does not build up on the processor. Scaling also becomes easier as process queues are load balanced across available processors. So adding more processors directly correlates to improved speed performance when utilizing process queues.
Use Custom Scripts for Large Transform Actions
Each transform in a transform action requires a small minimum memory allocation and CPU time. Usually this is negligible and, in the vast majority cases, doesn't negatively impact the overall memory allocation of the running process. However, there are a few situations where this minimum allocation could noticeably increase the memory footprint of the processor.
Consider the situation where a data flow is transforming a list of large documents. This could be a record that has dozens or even hundreds of properties. The conventional methodology for transforming all of these properties would be to create a single transform action that has dozens of individually defined transforms. Because each transform allocates a minimum amount of memory it will be expected that the transform could potentially have a large overall allocation. This allocation is compounded if the transform is iterating through a large set of these records.
Consider the hypothetical situation where a transform action has 100 individually defined transforms and each transform is iterating over 10,000 records. Each run of the transform action could instantiate 1 million different transform operations. At this scale, the many tiny allocations quickly add up to create a large overall minimum allocation for the transform action.
A more efficient way to manage this hypothetical transform would be to define a single transform operation that contains a custom script which includes all of the needed transforms. This would ensure that the minimum allocation directly corresponds with the number of iterations. The hypothetical transformation described above would only have 10 thousand allocations instead of 1 million.
For example, the minimum allocation would large in a situation where the document being transformed has 100 properties and there is a separate transform in the action for each of these, and the transforms are iterating over 10,000 documents.
Avoid creating many individual transform operations on a transform action.
# Each transform operation is individually defined and each incurs a separate minimum allocation per execution. # Transform operation 1 Input Key = records.id Transform = TEXT Output Key = output._id # Transform operation 2 Input Key = records.first Transform = TEXT_CAPITALIZE Output Key = output.firstName # Transform operation 3 Input Key = records.last Transform = TEXT_CAPITALIZE Output Key = output.lastName ... etc ...
Instead, create a single transform operation to more efficiently iterate over the data.
# Only one transform operation means that there is only one minimum allocation per execution. Input Key = records Transform = SCRIPT Output Key = output Script = [ "_id" => TEXT(VALUE.id), "firstName" => TEXT_CAPITALIZE(VALUE.first), "lastName" => TEXT_CAPITALIZE(VALUE.last), ... etc ... ]
Also note that consolidating large numbers of transform operations would also considerably improve the execution time.