Post #192 by WatchingTheHerd on the Shrewdsmith board

Halls of Shrewd'm / Shrewdsmith

Unthreaded | Threaded | Whole Thread (3)

Post New | Post Reply | Report Post | Recommend It!

No. of Recommendations: 10

Hi Manlobbi:

On 4/18/2024 at 4:00pm EDT, I noticed a problem which seems to recur not unfrequently on the site. It was experiencing similar systems earlier in the day as well.

Clicks to view pages were taking extremely long times to load. So long I sat down with a stopwatch to time them. Pages were taking consistently 3 minutes -- exactly -- to load. Then at about 4:09pm one page took 2:14 seconds to display a lower level web server error stating a downstream process was unavailable, at which point any clicks since then have been the normal BLAZING fast.

Based on prior experience with such things, the symptoms above seem to point to:

* one back-end server behind a load balancer locking up due to resources, lack of DB connections, etc.
* the load balancer is failing to detect that lockup and continuing to deliver traffic to it
* an upstream client or the client in the browser is using a default connection open timeout or response timeout limit of 3 minutes so it is WAITING that time before retrying or returning a failure

It doesn't appear that the root cause is due to underlying database horsepower, tuning or indexing because the system is normally BLAZING fast and reverts to BLAZING fast after some other gremlin clears. That would suggest a few things worth checking, depending upon the platforms used in the back-end of the system.

* core web service leaking open connections to the DB, eventually running out?
* timeout limit of the load balancer in front of that DB web service set at default value or set too high?
* health check on individual servers of the DB web service not configured or configured to only test the ability to open a connection rather than get a successful HTTP response?
* same type of problem with outer web services called by the browser client which then call these internal web services?

Different languages have different default behavior when wrapping database queries and many libraries claim that they automatically release DB connections back to a pool but that handling may not always work in some exception handling conditions, which can produce a slow leak of open connections leading to a web service instance which can ACCEPT work but cannot COMPLETE it since it is out of open DB connections.

Lowering socket timeout settings is a quick way to flow traffic around a zombie node but most libraries default to a 3 or even 6 minute timeout which is WAY too long for a service used for interactive traffic where users expect responses in 5-10 seconds.

The health check referenced above could be the health check performed by the load balancer to verify a member of the pool should continue receiving traffic or could be the health check performed by your hosting platform (Kubernetes?) to determine if a worker virutal machine is healthy or should be killed and replaced with a new instance.

WTH

Post New | Post Reply | Report Post | Recommend It!

Print the post

Unthreaded | Threaded | Whole Thread (3)

Prev | Next

Announcements

Shrewdsmith FAQ

Contact Shrewd'm
Contact the developer of these message boards.