Tuesday, April 17, 2012

Whose problem is it anyway?


Growing up in India cable TV did not make it to my home until my high school days. One of the early shows that caught my attention was the very funny syndicated improvisational game show – "Whose line is it anyway?"  In the eponymous round contestants on their turn use their creative instincts and quick-wittedness to “explain or demo” random and quirky looking props. The toughest OMS problems call for that same kind of creativity, (although the results or the scenario itself is far from being funny) and for someone to step up and make sense of the problem (random or otherwise) with a complex software solution that has a seemingly quirky side to it.

When a MQ Queue Full is not an MQ issue –
Here’s a typical problem encountered at a Sterling OMS implementation in the testing phase. A certain transaction say CREATE_ORDER fails with the following exception and stack trace –
com.yantra.interop.services.jms.JMSProducer$RetryException: com.ibm.msg.client.jms.DetailedInvalidDestinationException: JMSWMQ2007: Failed to send a message to destination 'CREATE_ORDER_QUEUE'. JMS attempted to perform an MQPUT or MQPUT1; however WebSphere MQ reported an error. Use the linked exception to determine the cause of this error.
        at com.yantra.interop.services.jms.JMSProducer.sendJMSMessage(JMSProducer.java:852)
       at com.yantra.interop.services.jms.JMSProducer.access$700(JMSProducer.java:63)
......
JMSCMQ0001: WebSphere MQ call failed with compcode '2' ('MQCC_FAILED') reason '2053' ('MQRC_Q_FULL'). [system]: JMSProducer
com.ibm.mq.MQException: JMSCMQ0001: WebSphere MQ call failed with compcode '2' ('MQCC_FAILED') reason '2053' ('MQRC_Q_FULL').

At first glance, this seems to be an MQ issue calling for the testing team to make a beeline to the WebSphere MQ administrator’s desk. However, a more thorough investigation calls for many additional checks to be done and questions to be answered before pinging the MQ Admin.
a.       Has the queue been sized appropriately for the environment?
b.      Are there processes – Sterling or otherwise - attached to and consuming messages from the queue?
c.       Are the messages from the queue being consumed at a much slower rate than incoming messages? 

Other Sterling OMS system and performance problems would entail weeding through many more questions such as
a.       Is it a browser issue?
b.      Is it a database tuning issue?
c.       Is it an Appserver configuration problem?
d.      Does the solution/product scale to meet our needs? 

Failure to consider all these questions to identify a root cause often leads to the conclusion that most Sterling OMS system problems are simply “a Sterling issue” (the industry is still to term this an IBM issue perhaps reserving that for their other woes on “traditional” products on the IBM tech stack). Whose problem is that anyway? Or to be more precise between an Implementation team – developers and testers, System admin team - DBAs, Appserver, JMS, AIX Admins and IBM Support who is going to own it and drive it to resolution? Thus, was born the role of a services focused Yantra/Sterling Performance Engineer in 2004 (Yantra as it was known up until 2005 the Sterling Commerce acquisition). The name Performance Engineer or PE has stuck although not all issues require performance tuning but because nothing else fitted either. 

How Performance Engineers are like Economists –
Steven Levitt in his best-seller SuperFreakonomics describes economists as being trained to be cold-blooded to calmly discuss trade-offs involved in a global catastrophe while the rest of us non-economists are a bit more excitable. A good Performance Engineer (Sterling or otherwise) is a lot like that economist and although he is not called on to explain implications of a global catastrophe like an earthquake or global-warming (a production outage being the biggest catastrophe that a PE is called on to solve) he needs to analyze issues calmly and keep emotions – blame, paralysis, confusion, panic, ego –  in check while collaborating with the various teams - business users, System Administrators, developers and Support to find a resolution. 

Had an economist been regarded as highly as a doctor or an engineer in the Indian middle class psyche perhaps I may have gone on to become one. Now 8 years since I first started as an in-house PE in the QA organization and 12 years since I started there as a Support Engineer I am still solving Sterling issues and still loving it.  This blog attempts to share what I have learnt over the years (and still learning) on implementing, fixing and tuning Sterling applications. Although it may be difficult to explore all Sterling issues in a simplistic Q & A format like that of asktom site hosted by the legendary Tom Kyte (the first “technology” guru I was and still am in awe of)  I shall experiment to see what best can be shared in this format. I am hoping that I can review your questions, try and answer some (or at least the most interesting and relevant ones) and other topics in these pages and most importantly nurture the inner "PE" in each of you.  

Do let me know your comments on this post & format and what Sterling topics you want to see covered (It will keep me from boring you with personal stories and not-particularly-useful insights).