Sunday, September 9, 2012

Agent framework scalability and Tuning considerations for high volumes

The core of the Sterling solution for many implementations lies in the Sterling agent framework and APIs provided for monitoring and order fulfillment. OMS implementations typically use Schedule Order, Release Order, ConsolidateToShipment, Real Time Availability Monitor agents to name a few. Although every agent works differently and there is no run_faster parameter available to scale up Sterling agents there are few underlying elements that vastly control the extent of their scalability. Very little is documented in the public domain on how exactly the agent framework works and to what extent it can scale so here goes my attempt to demystify agent operations and scalability. This post assumes that you are familiar with the OMS nomenclature else you may want to read my earlier post first.

How a Generic Agent works -
A generic Sterling agent is a background batch processing job that does the following -
  1. Check if there are messages to process from the configured JMS Queue. If the queue is empty post a getJobs message and go to Step 2 else go to Step 4.  
  2. Read the getJobs message and gets the first set of jobs (first batch) from the database using a getJobs method up to the defined buffer size (Number of records to buffer configuration which defaults to 5000).
  3. Writes these records back into the configured JMS queue in the form of executeJobs messages as well as the next getJobs message containing the last fetched record key such as an TaskQKey
  4. Retrieves executeJobs messages from the queue and does the necessary processing using the executeJobs method
  5. After finishing first batch gets the next set of jobs (second batch) up to the buffer size using the last fetched record key in the getJobs message. 
  6. Works on second set of jobs.
  7. Continue the above process till all the present jobs are worked upon.
  8. After all the present jobs are worked upon then wait for signal  i.e. the agent trigger  to start working again.
  9. Upon getting the signal to start, agent will start working again i.e. follow Step 2 to Step 7 
More details on default agent behavior - 
Triggering an Agent is the act of posting a getJobs message to the JMS queue. Triggering may be manual or automatic i.e. self triggered. During an agent startup if there are no messages in the queue an agent automatically triggers itself.
Within getJobs method, agent tries to acquire lock on YFS_OBJECT_LOCK table for agent Criteria ID
If lock is not available then getJobs method exits and does nothing. This is used to ensure that duplicate sets of records are not retrieved for processing. 
If lock is available then getJobs method fetches records which needs to be processed. 
Above records are posted as execute message to JMS queue. For each message depending on the JMS session pooling setting a new MQ session is created or borrowed to post the message and then session is closed or returned to the pool. This default behavior could change in an upcoming version as a result of the testing we undertook for one of our customers.
After the execute messages, one getJobs message is also posted with last record key so as to facilitate retrieval of next batch of messages.
Each thread of the agent picks execute message one by one and processes them. Multiple threads of execute method can run concurrently. 
After all the execute messages are consumed then only getJobs message is left in queue then the same agent thread uses the getJobs() method to process the getJobs message and continue the processing cycle.

Scalability concerns and Scaling the Availability Monitor agent -
Are Sterling agents multi-threaded?
Not entirely. The getJobs component of the agent working is deliberately made single threaded via the database locking on YFS_OBJECT_LOCK to ensure same set of records are not processed and retrieved multiple times. However, the bulk of the workload is on the executeJobs component which is multi-threaded and can run in multiple JVMs.

Will my agent scale to meet the peak throughput?
Depends on your volumes. Scaling an agent involves tuning the getJobs and the executeJobs component.  The scaling and tuning of the executeJobs component is a different exercise which varies depending on the use case so it will not be covered in this post. At low to medium volumes under 100K/hr scalability issue are largely with the executeJobs component. For workloads under 100K jobs/hour the default settings that governs agent behavior should work well. If you are using the agent framework to process over 150K "jobs" per hour there may be challenges using the default implementation. I use the term jobs to denote the message entity for e.g. Jobs in the case of ScheduleOrder are distinct Orders and for Availability Monitor it is distinct Inventory Items.

What are the elements that affect scaling beyond 100K jobs/hour?
  1. Performance of the getJobs query - Slower the query more time is spent on retrieving messages
  2. Time taken to write all of the retrieved executeJobs messages to the queue - Default behavior for creating and closing MQ sessions to write individual messages meant that was a significant overhead. Using the product HF to enable bulk loading of messages significantly improves the write time per message. Other aspects such as Persistence setting used for the queue, network latency between the agent servers and MQ server can also affect message write times to the queue. 
  3. Buffer size of messages to get - Default of 5K may not suffice at very high loads as it would mean 40 or more execution of the getJobs component to achieve just 200K throughput. Since getJobs is single threaded there needs to be an optimal number of executions of it.

Scaling the Real Time Availability Monitor (RTAM) Agent - A case study
At a customer site one of the challenges was to scale the Real Time Availability monitor agent to do the Partial Sync of inventory at over 250 K records/hour. The customer was running Sterling 8.5 HF 25, WAS 7 and MQ 7. Following actions were taken to scale the agent from about 150K/hr to around 300K /hr -
1. Tuning the getJobs query - Front loading the YFS_INVENTORY_ACTIVITY table would heavily skew the test results due to the excessive time spent querying it as part of getJobs query.  Hence, trimming or keeping the Inventory Activity record table under check significantly alters the time take for getJobs and also more realistically represents production work load.  We also ensured usage of the correct index and updated statistics.
2. Setting the agent queue to non-persistent - We defined the internal JMS Queue as non-persistent on MQ. Then, setting the PER(QDEF) option in the scp file for this queue's entry while generating the bindings. Writing each message to the persistent queue takes between 11-20ms whereas on the non-persistent queue it is under 5 ms.
3. Enabled JMS session pooling for this agent via the following property in customer_overrides.properties -
yfs.yfs.jms.session.disable.pooling=N
This allows sessions to be borrowed and returned to the pool instead of new ones getting created and closed for each message.
4. Enabling the bulk loader property for the agent framework to avoid creating and closing sessions for each message being posted to the queue. We worked with IBM Sterling support to accomplish this via 8.5 HF48. The below 2 properties were set in the customer_overrides.properties file 
yfs.agent.bulk.sender.enabled=Y
yfs.agent.bulk.sender.batch.size=50000
The batch size setting of 50000 should be equal to or great than the maximum buffer size you plan to use across all agents.
5. Running the agents in the same data center as the MQ server - This helps further improve the latency between the two tiers and therefore the overall performance. This may not always be possible if you are running agents in multiple data centers. 
6. Increasing the buffer size of records retrieved to 10000 from the default of 5000 - We tested with various settings between 5K and 25K and found that the overall performance was best at 10000 for our setup. The optimal setting for the buffer size may vary on your environment and workload so run performance tests to determine what works best for your needs.

Now that you have a better understanding of how the Sterling agent works you should be in a better position to troubleshoot and scale the agents. Happy testing and tuning!