DNS/azure/Moredetalsについて、ここに記述してください。
https://twitter.com/takekazuomi/status/1124507499501985793
昨日のAzureの障害、RCAが出てた。 変更プロセスをしくじって、4つのネームサーバーのうち1つレコードが空白のゾーンデータになり、nxdomain を返すようになった。その結果、http://database.windows.net などのクエリの25%が失敗するようになったのが原因 11:54 - 2019年5月4日
1. history
https://azure.microsoft.com/en-us/status/history/
More details: This incident resulted from the coincidence of two separate errors.
- Either error by itself would have been non-impacting:
1) Microsoft engineers executed a name server delegation change to update one name server for several Microsoft zones including Azure Storage and Azure SQL Database. Each of these zones has four name servers for redundancy, and the update was made to only one name server during this maintenance.
- A misconfiguration in the parameters of the automation being used to make the change resulted in an incorrect delegation for the name server under maintenance.
2) As an artifact of automation from prior maintenance, empty zone files existed on servers that were not the intended target of the assigned delegation. This by itself was not a problem as these name servers were not serving the zones in question.
Due to the configuration error in change automation in this instance, the name server delegation made during the maintenance targeted a name server that had an empty copy of the zones.
- As a result, this name server replied with negative (nxdomain) answers to all queries in the zones.
Since only one out of the four name server's records for the zones was incorrect, approximately one in four queries for the impacted zones would have received an incorrect negative response.
DNS resolvers may cache negative responses for some period of time (negative caching), so even though erroneous configuration was promptly fixed, customers continued to be impacted by this change for varying lengths of time.