How to sort out health check failure of deployment on ECS

Ats
4 min readJul 6, 2024

--

This is a note about what I did to deal with the health check failure of the deployment task on ECS

Photo by Michael Dziedzic on Unsplash

Background

I use GitHub Actions for CI/CD normally and work for a service running on ECS. Since 3~4 months ago, the CD came to fail to deploy from time to time and the console of GitHub Actions came to say like below.

But the console of ECS said the deployment was still in progress and eventually the deployment finished as taking 30~45 mins. Obviously, It was a problem, but I deprioritized it because it would eventually deploy. However, the time was getting longer bit by bit. Also, a new developer joined the team finally. So I decided to fix the problem and let him deploy whenever he wanted.

What I did

First of all, I checked the console of ECS and the deployment status. The status looked like below.

There ware failed tasks on ECS and there was still an ongoing deployment even thought the GitHub Actions failed. That’s why the deployment was completed eventually. Then I was looking for the reason why the task failed and found it in the logs.

From the logs, my deployment task was in the unhealthy status and replaced with a new one because of it. Then I was looking for why it was in the status. What I did was to find the logs where the service became unhealthy and open the target group.

Then I found my settings of healthy check.

So, from the setting, my service was failed because the server didn’t respond to the request to / path 3 times in row. I also checked the logs about my server in the failed server which I could open from the stopped task and let me to cloud watch. It was below.

From the logs, my application server took about 2 mins to start receiving the requests. So I guessed that would be why my server was failed and kept failed and replaced with another one. Also it told me why the CD by GitHub Actions was sometimes completed and sometimes failed. Specifically, the boot-up time for the server sometimes becames shorter than 90 secs and sometimes longer than 90 secs, I assume. I checked a few failed servers’ logs and the average time to boot up was about 90 secs. So, for this time, I set the grace period to ignore the boot-up time like below.

Fundamentally, I needed to investigate why the boot-up takes that time because 90 secs seem a bit long for just booting up. I think the CPU shortage could be the reason though I haven’t looked into it carefully. So the critical solution could be scale up the EC2 instance. However, I stoped working on it after adjusting the grace period because of my priorities. I hope I can fix it properly soon.

That’s it!

--

--

Ats
Ats

Written by Ats

I like building something tangible like touch, gesture, and voice. Ruby on Rails / React Native / Yocto / Raspberry Pi / Interaction Design / CIID IDP alumni

No responses yet