As an EM its part of our daya to day job to wear multiple hats and to know multiple language of communication to interact with stake holder. Let me explain with the help of an example regarding a situation where I need to explain complex tech problem to a non-tech person .
Situation : At hyperface an on call issue occured that our reward engine availability started decreasing as a result of which it was not able to process real-time reward and customer were getting affected.
1. we nee to identify the root cause.
2. buisness team got involved as this customers were calling them and asking for RCA . So to understand the problem better they wanted us to present the RC so that the can present the same to bank business team.
3. We identifed the problem that because for some transaction date was coming in wrong format as agreed by the bank and we were converting it to epoch time and as this was null and on converting epoch time to null it was trying to fetch all the transactions from 1970 and was killing the DB.
1. As step one we fixed in our code and brought everything to normal .
2. now to explain it business team we created a RCA doc followed recursing Why Approach .
a. Why this problem occured ?
This problem occuued as our reward processing engine ws going down.
b. Why reward engine was going down ?
As our Database load has increased because of which services were not able to communicated with DB and were dying.
c. Why DB load as increased.
DB Load increased because the data coming in transaction was of wrong format which resulted in wrong to and from date and was trying to large amount of transaction.
d. Why date wrong format was not handled ?
Wrong date format was not handled because it was agreed upon with the bank business that date format will never change , and due to fast pace developemt we also did not put any specific check.
e. Why other type of rewards were affected ?
as we are in monolith architecture and use same database and load on database was high so all the rewards were failing.
Steps to make sure it will not happen in future:
1. Handle validation for all fields and if validation failes then move that record to DLQ and raise alert.
2. PUT a check for range queuries if it is more then 1 year for a user then fail and move it ot DLQ.
1. Move it to push based architecture rahther then pull based.
1. Business team understood the prblem and that gave them confidence and they further communicated to external team that this was a problem form banks end and that will not happen .
2. They followed with bank team and got the format corrected.