At my current job one of our biggest customer raised the issue they were facing in our notification service. The issue was they were not receiving any notifications from our system on their new infrastructure. Their production infra was working fine.
I reviewed their configuration and everything was looking fine at the same time we hadn't done any changes to on our end which could cause this issue. Since their contract was up for renewal the issue, we needed quick resolution. I setup a live debugging session with customer to understand the issue in detail and their new infrastructure. After the first call it was clear that the issue was happening during https handshake or at network layer. To keep the momentum going I setup recurring call with with them and assured them that we are treating this as a high priority. On the next call I involved our security expert and SRE team for further debugging. However the issue not futher progress was made, at this point I insisted SRE team that we contact Azure support to understand if there is any issue at the tcp layer. Azure team looked at the turned on debug logs and found that our network was default MTU (maximum transmission unit) was higher than customers network MTU and customer network was dropping packets.
Finally I worked with SRE team to lower our MTU to match the customer configuration and ran tests. After the config change issue was resolved, our customer confirmed that they could see the notifications. This was really difficult issue I had ran across but I worked with various teams to get to the bottom of it and delivered results to customers satisfaction.