Sunday, October 7, 2012

When seeing is not believing - Agent flow misconfiguration unraveled



The old saying goes Seeing is believing but the other day while examing an issue at a customer environment I saw something that made me do a double take. For I could not believe what I saw on the Sterling application configuration.  Thus, the title of the blog  (not to mention my weakness for catchy titles). Read on to find out how the issue was investigated and learn more about agent/flow configuration internals.

Like most issues it started out mundane - an Invalid Server error from one of the agent logs. The relevant lines from the logs of the agent server AsyncReqAgentServer are pasted below -


<Errors>
    <Error ErrorCode="YCP0223" ErrorDescription="Invalid Server." ErrorRelatedMoreInfo="No Services Configured for this Server: AsyncReqAgentServer">
        <Attribute Name="ErrorCode" Value="YCP0223"/>
        <Attribute Name="ErrorDescription" Value="Invalid Server."/>
        <Attribute Name="ErrorRelatedMoreInfo" Value="No Services Configured for this Server: AsyncReqAgentServer"/>
        <Stack>com.yantra.interop.services.InvalidConfigurationException


The AsyncReqAgent is typically used to run the ASYNC_REQ_PROCESSOR transaction. So, I did what most of us would do check out the configuration of the ASYNC_REQ_PROCESSOR transaction.  Here is what I saw - 


Now, you can see what stumped me. On one hand the Application configuration is showing one thing while the same application logs is vehemently indicating another. Putting on my PE hat I figured that there is more to it that meets the eye and decided to dig a little deeper. 

First, I checked if the transaction is indeed running. A quick grep of the agent logs showed that it was running as part of the DefaultAgentServer as it was the DefaultAgentServer logs that had the "Starting service..." message.
Then, I decided to check the other environments to see where it is supposed to be running or configured. In Production I learnt that it was running under the AysncReqAgentServer. In lower environments it was running in a mixed mode but with most of them it was running on DefaultAgentServer.
At this stage a combination of instinct and experience led me to venture a guess that it is probably right in Production and just messed up here and elsewhere and I just have to prove that. 
So I checked the server configuration instead of the transaction configuration. This is a neat little configuration screen that is not very well known mostly because it is seldom used.  Buried in the Platform Application view > System Administration grouping is the Configured Servers view. This can be used both to view all the servers defined but also the details of sub services or agent criteria configured for each of the servers. Here is a screenshot - 



The sub service list tab shown is accessed by doubleclicking and viewing the details of an individual server. Here is what it showed for the AsyncReqAgentServer -

and for the DefaultAgentServer - 

So now that it was clear the logs were correct (atleast in this scenario) with the ASYNC_REQ_PROCESSOR indeed running as part of the DefaultAgentServer and the AsyncReqAgentServer having no services configured. Thus it was the configuration that was out of whack between the Server and Transaction configuration. That mystery is unraveled further if one digs in how these views are dispalyed and how configuration data is propagated. 

Transaction configuration view is based on the YFS_FLOW and YFS_SUB_FLOW tables whereas the server configuration view and its associated sub-services are built on the YFS_SERVER and YFS_AGENT_CRITERIA table. Normally, these config tables are always in sync if the configuration changes are all driven by manual changes. However, in most implementations the Master Config environment is maintained as the source of config changes and CDT is used to promote configuration changes to various environments. A problem in the MC environment normally a crash or an incorrect data fix could result in a mis-configuration. This mis-configuration is then promoted to environments via CDT. Production was spared because it was running an older version of the release and config changes were yet to be promoted there. 

Here is a query that I could have used to confirm my observations   - 

select agent_criteria_id, transaction_key, flow_key, server_key from yfs_agent_criteria 
where server_key in (select server_key from yfs_server where server_name = 'AsyncReqAgentServer)

It can be adapted for your situation for e.g. to determine what all services are configured under a particular server. So when it comes to Sterling OMS (and perhaps most things in life) if you don't believe what you see  just look further.