Thanks -
I'll try to get a debug version up to check things out... the unfortunate part is if I touch one of the existing repros, it immediately temporarily resolves. I made a change to my config to add another mapping to a test service and just doing the "nginx -s reload" made the repro just go away. I've now reset all repos but one and want to leave it alone until the others start reproing again.
Wanting to continue tracking this down while I wait for a repro on a debug instance...
For the DNS caching, I see in a search result that it appears NGINX is already doing some sort of caching by default? Either way, my pattern is happening multiple times per second if I cancel hanging calls. Ie, curl it - success, curl it - hang (cancel under half second), curl it - success, curl it -hang.... seems awfully quick to just be a cache timing issue. Additionally, if I leave one of the hanging calls going and start curl'ing it from somewhere else, all those other calls will succeed for the duration of that other hang. Does that still fit the DNS theory? I'm just not knowledgable enough there to know.
I will try ip-based routing as a debug step, but the service that we currently front (as well as many to come) use AWS's Elastic Load Balanacer as their entry point and have rolling dynamic IPs as they scale/swap out instances to meet demand. For production, we'll need to stay with domain name based proxy_pass definitions.
I did find the following logs (dns name and ip addresses modified):
2015/12/15 03:40:13 [warn] 10#0: *32460 upstream server temporarily disabled while connecting to upstream, client: 10.0.1.127, server: , request: "HEAD /receipts HTTP/1.1", upstream: "http://52.52.52.52:80/", host: "www.mydomain.com"
2015/12/15 03:55:16 [error] 10#0: *32520 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.1.127, server: , request: "GET /v4/fca509b424b04cbf8f58cca76faba5b0 HTTP/1.1", upstream: "http://52.52.52.52:80/v4/fca509b424b04cbf8f58cca76faba5b0", host: "www.mydomain.com"
There was nothing of interest in error.log or access.log. The only thing of note was that in access.log, the entries for a given request only showed up when the call completed (so in the 60 sec delay case it only shows in the log after 60 seconds when the response comes through.),
Thanks again for the help!
I'll try to get a debug version up to check things out... the unfortunate part is if I touch one of the existing repros, it immediately temporarily resolves. I made a change to my config to add another mapping to a test service and just doing the "nginx -s reload" made the repro just go away. I've now reset all repos but one and want to leave it alone until the others start reproing again.
Wanting to continue tracking this down while I wait for a repro on a debug instance...
For the DNS caching, I see in a search result that it appears NGINX is already doing some sort of caching by default? Either way, my pattern is happening multiple times per second if I cancel hanging calls. Ie, curl it - success, curl it - hang (cancel under half second), curl it - success, curl it -hang.... seems awfully quick to just be a cache timing issue. Additionally, if I leave one of the hanging calls going and start curl'ing it from somewhere else, all those other calls will succeed for the duration of that other hang. Does that still fit the DNS theory? I'm just not knowledgable enough there to know.
I will try ip-based routing as a debug step, but the service that we currently front (as well as many to come) use AWS's Elastic Load Balanacer as their entry point and have rolling dynamic IPs as they scale/swap out instances to meet demand. For production, we'll need to stay with domain name based proxy_pass definitions.
I did find the following logs (dns name and ip addresses modified):
2015/12/15 03:40:13 [warn] 10#0: *32460 upstream server temporarily disabled while connecting to upstream, client: 10.0.1.127, server: , request: "HEAD /receipts HTTP/1.1", upstream: "http://52.52.52.52:80/", host: "www.mydomain.com"
2015/12/15 03:55:16 [error] 10#0: *32520 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.1.127, server: , request: "GET /v4/fca509b424b04cbf8f58cca76faba5b0 HTTP/1.1", upstream: "http://52.52.52.52:80/v4/fca509b424b04cbf8f58cca76faba5b0", host: "www.mydomain.com"
There was nothing of interest in error.log or access.log. The only thing of note was that in access.log, the entries for a given request only showed up when the call completed (so in the 60 sec delay case it only shows in the log after 60 seconds when the response comes through.),
Thanks again for the help!