July 28, 2007

The Case Of The Differing Environments

I woke up feeling pretty good about myself. The new server was supposed to go live the evening before, and it had passed all of its production tests prior to my going to bed. Before going to bed, I'd even set my BlackBerry on loud so that if anything did go wrong, it would wake either myself, my wife or my granddaughter up so it could be taken care of. The red blinking light and the vibrating BlackBerry told me that there seemed to be a difference of opinion between me and RIM about what "loud" meant.

"Half of our credit card transactions are getting denied and we don't know why." Needless to say, that made me panic a bit. That portion of the server had been tested non-stop for nearly a month. It was the first part of the server to get completed, so it not working really threw me for a loop.

I VPN'ed into the server and looked at the audit logs. Sure enough, there was some goofy error code there that I had never seen before. I logged into our credit card processor's site and saw that the code was an Address Verification System failure. I went back to my code and saw that I had a check in for AVS failures that had passed unit tests against their test system, but it was a different error code.

Now, these people's cards would have been denied anyway, but they were getting the incorrect message. I was abstracting the processor code behind an enum of my own, so instead of code 101, they'd be getting CreditCardResult.DeniedMissingInfo and instead of code 204, they'd be getting back CreditCardResult.DeniedOverLimit. Any code that I didn't recognize that was a failure would come back as CreditCardResult.DeniedOther. The number of things that could result in DeniedOther were fairly small, but because the code I was getting back differed between test and production, AVS failures were returning DeniedOther instead of DeniedAVS.

Others were showing different errors. The bad card number error code was being used for missing card information, and other things weren't matching up either.

I spent a few minutes looking to see which code were being returned for which values and I was rather disturbed by the change. Did our tests pass because we were using values under $25? How did our test environment differ from our production environment?

After digging a bit deeper, I found the reason. Our test environment on the processor server side had been using GPN on the back end instead of Global Collect. For them, AVS errors were code 203 instead of code 200. All of the failure codes varied slightly as a result.

A quick bit of code to detect if we were in production or test and a quick mapping table for the correct codes to the proper environment, and all would be good again once the next production build was pushed live.

And now onto the next mystery...

No comments: