Sunday, October 7, 2012

When seeing is not believing - Agent flow misconfiguration unraveled



The old saying goes Seeing is believing but the other day while examing an issue at a customer environment I saw something that made me do a double take. For I could not believe what I saw on the Sterling application configuration.  Thus, the title of the blog  (not to mention my weakness for catchy titles). Read on to find out how the issue was investigated and learn more about agent/flow configuration internals.

Like most issues it started out mundane - an Invalid Server error from one of the agent logs. The relevant lines from the logs of the agent server AsyncReqAgentServer are pasted below -


<Errors>
    <Error ErrorCode="YCP0223" ErrorDescription="Invalid Server." ErrorRelatedMoreInfo="No Services Configured for this Server: AsyncReqAgentServer">
        <Attribute Name="ErrorCode" Value="YCP0223"/>
        <Attribute Name="ErrorDescription" Value="Invalid Server."/>
        <Attribute Name="ErrorRelatedMoreInfo" Value="No Services Configured for this Server: AsyncReqAgentServer"/>
        <Stack>com.yantra.interop.services.InvalidConfigurationException


The AsyncReqAgent is typically used to run the ASYNC_REQ_PROCESSOR transaction. So, I did what most of us would do check out the configuration of the ASYNC_REQ_PROCESSOR transaction.  Here is what I saw - 


Now, you can see what stumped me. On one hand the Application configuration is showing one thing while the same application logs is vehemently indicating another. Putting on my PE hat I figured that there is more to it that meets the eye and decided to dig a little deeper. 

First, I checked if the transaction is indeed running. A quick grep of the agent logs showed that it was running as part of the DefaultAgentServer as it was the DefaultAgentServer logs that had the "Starting service..." message.
Then, I decided to check the other environments to see where it is supposed to be running or configured. In Production I learnt that it was running under the AysncReqAgentServer. In lower environments it was running in a mixed mode but with most of them it was running on DefaultAgentServer.
At this stage a combination of instinct and experience led me to venture a guess that it is probably right in Production and just messed up here and elsewhere and I just have to prove that. 
So I checked the server configuration instead of the transaction configuration. This is a neat little configuration screen that is not very well known mostly because it is seldom used.  Buried in the Platform Application view > System Administration grouping is the Configured Servers view. This can be used both to view all the servers defined but also the details of sub services or agent criteria configured for each of the servers. Here is a screenshot - 



The sub service list tab shown is accessed by doubleclicking and viewing the details of an individual server. Here is what it showed for the AsyncReqAgentServer -

and for the DefaultAgentServer - 

So now that it was clear the logs were correct (atleast in this scenario) with the ASYNC_REQ_PROCESSOR indeed running as part of the DefaultAgentServer and the AsyncReqAgentServer having no services configured. Thus it was the configuration that was out of whack between the Server and Transaction configuration. That mystery is unraveled further if one digs in how these views are dispalyed and how configuration data is propagated. 

Transaction configuration view is based on the YFS_FLOW and YFS_SUB_FLOW tables whereas the server configuration view and its associated sub-services are built on the YFS_SERVER and YFS_AGENT_CRITERIA table. Normally, these config tables are always in sync if the configuration changes are all driven by manual changes. However, in most implementations the Master Config environment is maintained as the source of config changes and CDT is used to promote configuration changes to various environments. A problem in the MC environment normally a crash or an incorrect data fix could result in a mis-configuration. This mis-configuration is then promoted to environments via CDT. Production was spared because it was running an older version of the release and config changes were yet to be promoted there. 

Here is a query that I could have used to confirm my observations   - 

select agent_criteria_id, transaction_key, flow_key, server_key from yfs_agent_criteria 
where server_key in (select server_key from yfs_server where server_name = 'AsyncReqAgentServer)

It can be adapted for your situation for e.g. to determine what all services are configured under a particular server. So when it comes to Sterling OMS (and perhaps most things in life) if you don't believe what you see  just look further. 



Sunday, September 9, 2012

Agent framework scalability and Tuning considerations for high volumes

The core of the Sterling solution for many implementations lies in the Sterling agent framework and APIs provided for monitoring and order fulfillment. OMS implementations typically use Schedule Order, Release Order, ConsolidateToShipment, Real Time Availability Monitor agents to name a few. Although every agent works differently and there is no run_faster parameter available to scale up Sterling agents there are few underlying elements that vastly control the extent of their scalability. Very little is documented in the public domain on how exactly the agent framework works and to what extent it can scale so here goes my attempt to demystify agent operations and scalability. This post assumes that you are familiar with the OMS nomenclature else you may want to read my earlier post first.

How a Generic Agent works -
A generic Sterling agent is a background batch processing job that does the following -
  1. Check if there are messages to process from the configured JMS Queue. If the queue is empty post a getJobs message and go to Step 2 else go to Step 4.  
  2. Read the getJobs message and gets the first set of jobs (first batch) from the database using a getJobs method up to the defined buffer size (Number of records to buffer configuration which defaults to 5000).
  3. Writes these records back into the configured JMS queue in the form of executeJobs messages as well as the next getJobs message containing the last fetched record key such as an TaskQKey
  4. Retrieves executeJobs messages from the queue and does the necessary processing using the executeJobs method
  5. After finishing first batch gets the next set of jobs (second batch) up to the buffer size using the last fetched record key in the getJobs message. 
  6. Works on second set of jobs.
  7. Continue the above process till all the present jobs are worked upon.
  8. After all the present jobs are worked upon then wait for signal  i.e. the agent trigger  to start working again.
  9. Upon getting the signal to start, agent will start working again i.e. follow Step 2 to Step 7 
More details on default agent behavior - 
Triggering an Agent is the act of posting a getJobs message to the JMS queue. Triggering may be manual or automatic i.e. self triggered. During an agent startup if there are no messages in the queue an agent automatically triggers itself.
Within getJobs method, agent tries to acquire lock on YFS_OBJECT_LOCK table for agent Criteria ID
If lock is not available then getJobs method exits and does nothing. This is used to ensure that duplicate sets of records are not retrieved for processing. 
If lock is available then getJobs method fetches records which needs to be processed. 
Above records are posted as execute message to JMS queue. For each message depending on the JMS session pooling setting a new MQ session is created or borrowed to post the message and then session is closed or returned to the pool. This default behavior could change in an upcoming version as a result of the testing we undertook for one of our customers.
After the execute messages, one getJobs message is also posted with last record key so as to facilitate retrieval of next batch of messages.
Each thread of the agent picks execute message one by one and processes them. Multiple threads of execute method can run concurrently. 
After all the execute messages are consumed then only getJobs message is left in queue then the same agent thread uses the getJobs() method to process the getJobs message and continue the processing cycle.

Scalability concerns and Scaling the Availability Monitor agent -
Are Sterling agents multi-threaded?
Not entirely. The getJobs component of the agent working is deliberately made single threaded via the database locking on YFS_OBJECT_LOCK to ensure same set of records are not processed and retrieved multiple times. However, the bulk of the workload is on the executeJobs component which is multi-threaded and can run in multiple JVMs.

Will my agent scale to meet the peak throughput?
Depends on your volumes. Scaling an agent involves tuning the getJobs and the executeJobs component.  The scaling and tuning of the executeJobs component is a different exercise which varies depending on the use case so it will not be covered in this post. At low to medium volumes under 100K/hr scalability issue are largely with the executeJobs component. For workloads under 100K jobs/hour the default settings that governs agent behavior should work well. If you are using the agent framework to process over 150K "jobs" per hour there may be challenges using the default implementation. I use the term jobs to denote the message entity for e.g. Jobs in the case of ScheduleOrder are distinct Orders and for Availability Monitor it is distinct Inventory Items.

What are the elements that affect scaling beyond 100K jobs/hour?
  1. Performance of the getJobs query - Slower the query more time is spent on retrieving messages
  2. Time taken to write all of the retrieved executeJobs messages to the queue - Default behavior for creating and closing MQ sessions to write individual messages meant that was a significant overhead. Using the product HF to enable bulk loading of messages significantly improves the write time per message. Other aspects such as Persistence setting used for the queue, network latency between the agent servers and MQ server can also affect message write times to the queue. 
  3. Buffer size of messages to get - Default of 5K may not suffice at very high loads as it would mean 40 or more execution of the getJobs component to achieve just 200K throughput. Since getJobs is single threaded there needs to be an optimal number of executions of it.

Scaling the Real Time Availability Monitor (RTAM) Agent - A case study
At a customer site one of the challenges was to scale the Real Time Availability monitor agent to do the Partial Sync of inventory at over 250 K records/hour. The customer was running Sterling 8.5 HF 25, WAS 7 and MQ 7. Following actions were taken to scale the agent from about 150K/hr to around 300K /hr -
1. Tuning the getJobs query - Front loading the YFS_INVENTORY_ACTIVITY table would heavily skew the test results due to the excessive time spent querying it as part of getJobs query.  Hence, trimming or keeping the Inventory Activity record table under check significantly alters the time take for getJobs and also more realistically represents production work load.  We also ensured usage of the correct index and updated statistics.
2. Setting the agent queue to non-persistent - We defined the internal JMS Queue as non-persistent on MQ. Then, setting the PER(QDEF) option in the scp file for this queue's entry while generating the bindings. Writing each message to the persistent queue takes between 11-20ms whereas on the non-persistent queue it is under 5 ms.
3. Enabled JMS session pooling for this agent via the following property in customer_overrides.properties -
yfs.yfs.jms.session.disable.pooling=N
This allows sessions to be borrowed and returned to the pool instead of new ones getting created and closed for each message.
4. Enabling the bulk loader property for the agent framework to avoid creating and closing sessions for each message being posted to the queue. We worked with IBM Sterling support to accomplish this via 8.5 HF48. The below 2 properties were set in the customer_overrides.properties file 
yfs.agent.bulk.sender.enabled=Y
yfs.agent.bulk.sender.batch.size=50000
The batch size setting of 50000 should be equal to or great than the maximum buffer size you plan to use across all agents.
5. Running the agents in the same data center as the MQ server - This helps further improve the latency between the two tiers and therefore the overall performance. This may not always be possible if you are running agents in multiple data centers. 
6. Increasing the buffer size of records retrieved to 10000 from the default of 5000 - We tested with various settings between 5K and 25K and found that the overall performance was best at 10000 for our setup. The optimal setting for the buffer size may vary on your environment and workload so run performance tests to determine what works best for your needs.

Now that you have a better understanding of how the Sterling agent works you should be in a better position to troubleshoot and scale the agents. Happy testing and tuning! 

Saturday, July 28, 2012

OMS Transaction Framework - Nomenclature and more


After a long break following my first post - longer thanks to the distractions of the Euro Cup and Wimbledon - I am back with a little tidbit on nomenclature related to the Sterling transaction framework. If you have been stumped by whether a Sterling process is an agent server or integration server then this info would help you make the right call. 

For many years as I have worked with various customers, colleagues and partners I would hear people using various Sterling terms - agent server, integration server, transaction interchangably. Although the Sterling OMS world is not what it was in 2000 and as the lines are getting blurred as traditional "agent" processes are being implemented as services I figured I should tackle this topic in my blog. Earlier this week when one of my colleagues mentioned that this was a topic he too had explained for the n-th time to a new customer and pinged me looking for such a write-up I figured that it was time to put pen to paper or rather finger to keyboard. (For illustrations do refer the Sterling product documentation guides - ftp://public.dhe.ibm.com/software/commerce/doc/ssfs/85/Application_Platform_Configuration_Guide.pdf)

Grab a cup of your favorite beverage as this post does get a little long..

Transactions - In software parlance, a transaction usually means a sequence of information exchange and related work (such as database updating) that is treated as a unit for the purposes of satisfying a request while ensuring data integrity. Transactions may be synchronous such as those running in the UI or Asynchronous such as the batch jobs.  In the Sterling world the product extensibility and flexibility make the boundaries of a seemingly similar transaction vary from implementation to implementation even if the project teams and customers may call it the same such as Create Order Transaction or even dropping the transaction and referring to it as Create Order. In Sterling these transactions are defined either as an agent criteria or a Service via the Application Configurator. These transactions are executed in background JVMs known as agent servers or integration servers (also referred to as batch jobs) or directly from the Sterling UI (traditional console or thick client) or a Webservice call from external systems on the Application server JVM. Transactions consist of the underlying API and its associated events, user exits and conditions. A successful transaction results in the changes being committed that usually involves a combination of Database updates and messages being written and read from a queue or file. Either the entire transaction is successful or an error is thrown which causes the entire transaction to be rolled back or an error to be raised for subsequent reprocessing.  

In Sterling MCF we can classify processes into the following types of transactions :-
1. Time-Triggered transactions or Agents- which are triggered on a scheduled basis to perform repetitive actions.  Actions typically include invoking APIs to perform database updates. for e.g. consolidation of orders to shipments that may need to happen around 30 minutes apart so the Consolidate To Shipment time triggered transaction can be configured to trigger every 30 minutes. Most of the time triggered transactions are driven by records in YFS_TASK_Q table or based on the pipeline. Time triggered transactions are defined by the Transaction Name and the Agent Criteria.  They can be run in single or multi thread mode and are also called agents and the servers in which they run being called agent server.  Three types of Time-triggered transactions are :-
i. Business Process transactions - Responsible for majority of processing entities such as orders (sales/purchase/transfer) and shipments. The entities in every implementation will require one or more business process transactions such as CONSOLIDATE_TO_SHIPMENT, CLOSE_ORDER to complete their lifecycle. Understanding limitations of the Sterling transaction framework and designing for your business needs can help you get the most out of the solution.
ii. Purge transactions - Archive data from live (transaction) tables to history tables or delete that data that does not require archiving. Helps to mitigate unrestricted growth of the OMS  transactional database. Frequently underestimated in value and in development+testing efforts and overlooked in most implementations leading to application performance issues.
iii. Task Q Syncher Time-Triggered Transactions – A relatively new addition to the fold and is used to update the task queue repository table with the latest list of open tasks to be performed  by the corresponding each transaction, based on the latest pipeline configuration. 4 of these transactions are available - Load Execution, Order Fulfillment, Order Delivery and Order Negotiation
iv. Monitors – These are circumstance driven transactions that watch for processes or circumstances that are out of bounds and then raise alerts. Common monitors are those for Order, Shipment, Inventory Availability and Exceptions. Monitoring jobs can be a huge system hog if the data is not being purged often and if excessive stale entities exist such as abandoned or erroneous orders.

2. Services or Flows – Transactions that are executed NOT in pre-defined times are called services or flows. In the Database and Configuration screen titles this name is also used for every transaction in the SDF. Services can be invoked via use of broadly available transports - Web service/SOAP, HTTP, JMS, MSMQ, DB, flat file etc. A service can invoke other services to make a longer chain of services.  A service could include invoking APIs (product or custom), evaluating conditions, making DB updates etc. The services are processed continuously subject to thread and resource availability and are not triggered at any particular time. They can be run in single or multi thread mode and the servers in which they are executed are called integration servers. The most common scenario is the use of services to read messages from an inbound queue to Sterling for example to Create orders flowing in from a web channel.

3. Externally-Triggered Transactions - An externally-triggered transaction is used to map a service invoked to a Sterling transaction and to leverage the transaction framework. Seldom used in the real world as implementations prefer to just use a service/flow minus the transaction instead.

4. User-Triggered Transactions - A user-triggered transaction is invoked manually through the Application Consoles, a configured alert queue, or an e-mail service.  Never seen it used in the field.  So if you are implementing this or the externally-triggered do let me know how it goes.

Composite services – A construct to enable invocation of multiple services in parallel. A very useful concept ever since its addition to the SDF but needs careful testing as implementations could run into issues stemming from funky exception handling or inadequate logging.

Agent Criteria – An element that describes attributes that are specific to a time-triggered transaction. These attributes include the selector criteria such as Organization code, Manual or Auto triggered, trigger interval and server name. A particular transaction may have one or more agent criteria for processing data for different organizations or other logical grouping. E.g. Schedule Order agent criteria could be used to run scheduling for different organizations at different intervals.

Agent Server - Server JVM on which one or more agent criteria (commonly referred to as agents) can run. Invokes the com.yantra.integration.adapter.IntegrationAdapter class and is started typically by a startIntegrationServer.sh script provided as part of the product installation.

Integration Server - Server JVM on which one or more integration services or flows (commonly referred to as services or mistakenly called agents) are run. Invokes the com.yantra.integration.adapter.IntegrationAdapter class and is started by a startIntegrationServer.sh. 
Yes, you read it right! Both agents and integration services are started by the same class and script but the server name, service name or agent criteria name and definition controls the behavior.

Trigger agent - This is the process that is typically invoked via Cron or Ctrl-M jobs to trigger a certain time triggered transaction at certain points in time using the triggeragent.sh or triggeragent.cmd script.  For e.g. to Create Waves at certain hours of the day in a WMS implementation the trigger agent job could be invoked to trigger Create Wave agent or to run a nightly purge of sales order we could trigger the Order Purge agent.

Events – Help accomplish certain specific actions executed upon a certain business event occurring. For e.g ON_SUCCESS of Create Shipment we could have an event to send an e-mail to the customer with the shipment details or ON_BACKORDER of Schedule Order could used to raise Alert to the Inventory Control Business team. Event Handlers are configured to associate the required Actions to a particular event. Conditions are often used to further customize the action taken. Event handlers can invoke any service to e-mail, or raise exception alert;  Publish XML to external queues/database or Invoke custom services.  Actions associated are triggered any time the transaction is raised and when applicable so use it with caution. Excessive number of and complicated actions can prolong a transaction so use them wisely and tune them well.

User Exits – These enable transactions to invoke custom logic to interact with external systems synchronously to complete processing. A classic example is in the Payment Agent for credit card authorization. Frequently a source of issues when not implemented well and only care while designing and testing can avoid myriad issues post production. 


Tuesday, April 17, 2012

Whose problem is it anyway?


Growing up in India cable TV did not make it to my home until my high school days. One of the early shows that caught my attention was the very funny syndicated improvisational game show – "Whose line is it anyway?"  In the eponymous round contestants on their turn use their creative instincts and quick-wittedness to “explain or demo” random and quirky looking props. The toughest OMS problems call for that same kind of creativity, (although the results or the scenario itself is far from being funny) and for someone to step up and make sense of the problem (random or otherwise) with a complex software solution that has a seemingly quirky side to it.

When a MQ Queue Full is not an MQ issue –
Here’s a typical problem encountered at a Sterling OMS implementation in the testing phase. A certain transaction say CREATE_ORDER fails with the following exception and stack trace –
com.yantra.interop.services.jms.JMSProducer$RetryException: com.ibm.msg.client.jms.DetailedInvalidDestinationException: JMSWMQ2007: Failed to send a message to destination 'CREATE_ORDER_QUEUE'. JMS attempted to perform an MQPUT or MQPUT1; however WebSphere MQ reported an error. Use the linked exception to determine the cause of this error.
        at com.yantra.interop.services.jms.JMSProducer.sendJMSMessage(JMSProducer.java:852)
       at com.yantra.interop.services.jms.JMSProducer.access$700(JMSProducer.java:63)
......
JMSCMQ0001: WebSphere MQ call failed with compcode '2' ('MQCC_FAILED') reason '2053' ('MQRC_Q_FULL'). [system]: JMSProducer
com.ibm.mq.MQException: JMSCMQ0001: WebSphere MQ call failed with compcode '2' ('MQCC_FAILED') reason '2053' ('MQRC_Q_FULL').

At first glance, this seems to be an MQ issue calling for the testing team to make a beeline to the WebSphere MQ administrator’s desk. However, a more thorough investigation calls for many additional checks to be done and questions to be answered before pinging the MQ Admin.
a.       Has the queue been sized appropriately for the environment?
b.      Are there processes – Sterling or otherwise - attached to and consuming messages from the queue?
c.       Are the messages from the queue being consumed at a much slower rate than incoming messages? 

Other Sterling OMS system and performance problems would entail weeding through many more questions such as
a.       Is it a browser issue?
b.      Is it a database tuning issue?
c.       Is it an Appserver configuration problem?
d.      Does the solution/product scale to meet our needs? 

Failure to consider all these questions to identify a root cause often leads to the conclusion that most Sterling OMS system problems are simply “a Sterling issue” (the industry is still to term this an IBM issue perhaps reserving that for their other woes on “traditional” products on the IBM tech stack). Whose problem is that anyway? Or to be more precise between an Implementation team – developers and testers, System admin team - DBAs, Appserver, JMS, AIX Admins and IBM Support who is going to own it and drive it to resolution? Thus, was born the role of a services focused Yantra/Sterling Performance Engineer in 2004 (Yantra as it was known up until 2005 the Sterling Commerce acquisition). The name Performance Engineer or PE has stuck although not all issues require performance tuning but because nothing else fitted either. 

How Performance Engineers are like Economists –
Steven Levitt in his best-seller SuperFreakonomics describes economists as being trained to be cold-blooded to calmly discuss trade-offs involved in a global catastrophe while the rest of us non-economists are a bit more excitable. A good Performance Engineer (Sterling or otherwise) is a lot like that economist and although he is not called on to explain implications of a global catastrophe like an earthquake or global-warming (a production outage being the biggest catastrophe that a PE is called on to solve) he needs to analyze issues calmly and keep emotions – blame, paralysis, confusion, panic, ego –  in check while collaborating with the various teams - business users, System Administrators, developers and Support to find a resolution. 

Had an economist been regarded as highly as a doctor or an engineer in the Indian middle class psyche perhaps I may have gone on to become one. Now 8 years since I first started as an in-house PE in the QA organization and 12 years since I started there as a Support Engineer I am still solving Sterling issues and still loving it.  This blog attempts to share what I have learnt over the years (and still learning) on implementing, fixing and tuning Sterling applications. Although it may be difficult to explore all Sterling issues in a simplistic Q & A format like that of asktom site hosted by the legendary Tom Kyte (the first “technology” guru I was and still am in awe of)  I shall experiment to see what best can be shared in this format. I am hoping that I can review your questions, try and answer some (or at least the most interesting and relevant ones) and other topics in these pages and most importantly nurture the inner "PE" in each of you.  

Do let me know your comments on this post & format and what Sterling topics you want to see covered (It will keep me from boring you with personal stories and not-particularly-useful insights).