2011年4月21日木曜日

Amazon Cloud障害に思う

ということで今朝出社した時点でAmazon Cloudの障害を知ったのであった。やられているのはUS-EastのECNとRDSで、自分が関わっているシステムの本番DB(RDS)も引っかかっている。あと、チームの作業を記録しているRedmineのDBもEBS上にあるので、やはりダメ。

以下、西海岸午前9時時点でのAmazon側発表。

ECN

1:41 AM PDT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region. 
2:18 AM PDT We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region. Increased error rates are affecting EBS CreateVolume API calls. We continue to work towards resolution. 
2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution. 
3:20 AM PDT Delayed EC2 instance launches and EBS API error rates are recovering. We're continuing to work towards full resolution. 
4:09 AM PDT EBS volume latency and API errors have recovered in one of the two impacted Availability Zones in US-EAST-1. We are continuing to work to resolve the issues in the second impacted Availability Zone. The errors, which started at 12:55AM PDT, began recovering at 2:55am PDT 
5:02 AM PDT Latency has recovered for a portion of the impacted EBS volumes. We are continuing to work to resolve the remaining issues with EBS volume latency and error rates in a single Availability Zone. 
6:09 AM PDT EBS API errors and volume latencies in the affected availability zone remain. We are continuing to work towards resolution. 
6:59 AM PDT There has been a moderate increase in error rates for CreateVolume. This may impact the launch of new EBS-backed EC2 instances in multiple availability zones in the US-EAST-1 region. Launches of instance store AMIs are currently unaffected. We are continuing to work on resolving this issue. 
7:40 AM PDT In addition to the EBS volume latencies, EBS-backed instances in the US-EAST-1 region are failing at a high rate. This is due to a high error rate for creating new volumes in this region.
RDS
 1:48 AM PDT We are currently investigating connectivity and latency issues with RDS database instances in the US-EAST-1 region. 
2:16 AM PDT We can confirm connectivity issues impacting RDS database instances across multiple availability zones in the US-EAST-1 region. 
3:05 AM PDT We are continuing to see connectivity issues impacting some RDS database instances in multiple availability zones in the US-EAST-1 region. Some Multi AZ failovers are taking longer than expected. We continue to work towards resolution. 
4:03 AM PDT We are making progress on failovers for Multi AZ instances and restore access to them. This event is also impacting RDS instance creation times in a single Availability Zone. We continue to work towards the resolution. 
5:06 AM PDT IO latency issues have recovered in one of the two impacted Availability Zones in US-EAST-1. We continue to make progress on restoring access and resolving IO latency issues for remaining affected RDS database instances. 
6:29 AM PDT We continue to work on restoring access to the affected Multi AZ instances and resolving the IO latency issues impacting RDS instances in the single availability zone. 
8:12 AM PDT Despite the continued effort from the team to resolve the issue we have not made any meaningful progress for the affected database instances since the last update. Create and Restore requests for RDS database instances are not succeeding in US-EAST-1 region.
Amazonの中の人達が頑張ってくれているけど、今のところ回復の予定は不明。Multi-AZも(最終的には切り替わったようだけど)切り替えに時間がかかっているということだったので、多重化して備えていた人たちもそれなりに影響があったのではないか。

こういう事件があると「ほ〜らパブリッククラウドは危険だよ」と言う人が出てくるだろうけど、自分はむしろ今回の事故で「より安く確実に待機系に切り替える方法」が色々提案されると予想している。要はEBSが死んだ時の対処方法だわな。

EBSのSnapshotをS3に定期的に投げて、EBSが死んでAZ切り替えもできないときはEC2のでかいのを立ち上げてそこでVolume復帰させる、ってのがいいのかな。うちらの使ってるのは10GB程度なので非常時はなんとかなるんじゃなかろうか。

09:54追記
8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
暫定情報だが、障害の原因が追加されていた。
  • ネットワーク障害でEBSボリュームで大規模な複写(mirroring)が多発。
  • US-EastのAZにおける領域が不足。結果、EBS上でのVolume生成速度が低下。
  • EBSの制御系一部に処理が殺到。
今はAmazonの中の人達が頑張ってEBSの領域を追加しているらしい。待つしかないけど、頑張れ中の人達!

15:00追記
2:35 PM PDT We have restored access to the majority of RDS Multi AZ instances and continue to work on the remaining affected instances. A single Availability Zone in the US-EAST-1 region continues to experience problems for launching new RDS database instances. All other Availability Zones are operating normally. Customers with snapshots/backups of their instances in the affected Availability zone can restore them into another zone. We recommend that customers do not target a specific Availability Zone when creating or restoring new RDS database instances. We have updated our service to avoid placing any RDS instances in the impaired zone for untargeted requests.
うちらのRDSインスタンスはまだ死んでいるけど、スナップショットは見えるようになったのでそれを元に別のRDSインスタンスを立ち上げてサービス復旧。丸々14時間止めてしまったのは自分でも情けない。ということで、EBSと並行してS3にもスナップショットを置くようにする。

20:50追記

まだEBS/RDS障害は続いている。Hootsuiteとか大変なんだろうなぁ。

前述のようにRDSのスナップショットをEBSとS3に取るようにしたので、安心して眠れるようになった。

0 件のコメント:

コメントを投稿