Oh and I missed the obvious one, the huge spike in user CPU at the same time.
Anyway, this dashboard is locked at a timeframe that shows the spike, the outage, and the recovery for anyone wanting to see:
https://app.datadoghq.com/dash/host/14232198117?refresh_mode=paused&view...
On 1/30/24 21:58, Phil Dibowitz wrote:
Er, you're off by two hours 8:10am- 9:39am PT, best I can tell.
I just got home from work, but I did some digging around.
I don't think it's apache using too much memory. I don't think it's memory related at all.
Here's useful things I've found:
- Load average SPIKED a little while before the outage at around 7:52.
It came back down before 8 though
- We had a spike in network traffic at exactly the same time. Only
7MiB/s, but way bigger than our usual traffic.
- There was a spike in memory usage at the same time, but the system was
no wear near out of memory.
The last datapoint DD got from Apache was at 7:54
Odd thing that I don't think is related but is funny. The amount of
memory journald is using drops significantly when apache gets killed from 215MB to 160MB. At the beginning the outage it ~188. I just restarted it and it's at 22MB. It's the top users of RSS memory on the box, or was until I restarted it.
Based on the fact that every OTHER process on the system was reporting into DD, but apache was not, it seems pretty likely that every thread in apache is tied up on _something_ - either talking to a DB, a bad client (seems unlikely since we have a cloudflare in front of us), or something else.
That's all I got for now.
On 1/30/24 12:04, Ilan Rabinovitch wrote:
Looks like we had another outage for about an hour today. 10am - 11:36am PT.
On Mon, Jan 8, 2024 at 9:53 AM Phil Dibowitz <phil@ipom.com mailto:phil@ipom.com> wrote:
No, I woke up, saw the alert, bounced httpd, and had to run out the door.
It's a reasonable guess that it's the same memory issue as before. We're currently bouncing apache in cron every-other-hour, which seems to mostly keep us up, but occasionally not, so my guess is that's RIGHT at the threshold.
I don't have the bandwidth the go spelunking - the httpd.conf adjustments I made in... November or whatever that was seemed to help, but obviously they didn't solve everything. I still suspect that Drupal is causing connections to be held open somewhere, but that's pure speculation.
We have the data, between the apache_status.log and DD to figure it out, I'd just need time to do it, and I don't have the cycles right now.
On 1/7/24 17:11, Ilan Rabinovitch wrote: > Looks like we had another outage today. Any idea whats up? > > On Mon, Dec 25, 2023 at 1:17 AM Ilan Rabinovitch <ilan@linuxfests.org mailto:ilan@linuxfests.org > <mailto:ilan@linuxfests.org mailto:ilan@linuxfests.org>> wrote: > > Looks like another outage about 2 hours ago. > > On Sun, Dec 24, 2023 at 8:45 PM Ilan Rabinovitch > <ilan@linuxfests.org mailto:ilan@linuxfests.org <mailto:ilan@linuxfests.org mailto:ilan@linuxfests.org>> wrote: > > > > website was down again.. looks like for about an hour. ive just > restarted httpd. > > On Sat, Oct 21, 2023 at 5:00 PM Phil Dibowitz <phil@ipom.com mailto:phil@ipom.com > <mailto:phil@ipom.com mailto:phil@ipom.com>> wrote: > > Logs look good. > > https://github.com/socallinuxexpo/scale-chef/pull/302 https://github.com/socallinuxexpo/scale-chef/pull/302 > <https://github.com/socallinuxexpo/scale-chef/pull/302 https://github.com/socallinuxexpo/scale-chef/pull/302> > drops restarts to > every 2 hours. Merged. > > On 10/21/23 13:47, Phil Dibowitz wrote: > > I dropped it to less often a few days ago, I haven't yet > looked at the > > logs to see what the status of httpd is at those times. > My plan is to do > > that, then drop it to less often, and rinse and repeat. > > > > Sorry new job is keeping me super busy. > > > > On 10/21/23 08:25, Ilan Rabinovitch wrote: > >> Are we able to remove the cron job that's restarting httpd? > >> > >> On Sun, Oct 8, 2023 at 12:48 AM Phil Dibowitz > <phil@ipom.com mailto:phil@ipom.com <mailto:phil@ipom.com mailto:phil@ipom.com>> wrote: > >>> > >>> On 10/7/23 14:30, Phillip Smith wrote: > >>>> Yes, let's schedule something. I'm available tomorrow > morning and > >>>> Tuesday evening. > >>> > >>> What can I provide you in the mean time? I thought I > saw an email from > >>> you saying you wanted Datadog access, but I can't seem > to find it. > >>> > >>> Davide may have time Tuesday, I will have time Wed > evening. I'm out of > >>> town until Monday and have plans Mon/Tues evening. > >>> > >>> - Phil > >>> > >>> > >>> _______________________________________________ > >>> scale-infra mailing list > >>> scale-infra@lists.linuxfests.org mailto:scale-infra@lists.linuxfests.org > <mailto:scale-infra@lists.linuxfests.org mailto:scale-infra@lists.linuxfests.org> > >>> > https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra <https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra> > >> _______________________________________________ > >> scale-infra mailing list > >> scale-infra@lists.linuxfests.org mailto:scale-infra@lists.linuxfests.org > <mailto:scale-infra@lists.linuxfests.org mailto:scale-infra@lists.linuxfests.org> > >> > https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra <https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra> > > > > -- > Phil Dibowitz phil@ipom.com mailto:phil@ipom.com <mailto:phil@ipom.com mailto:phil@ipom.com> > Open Source software and tech docs Insanity Palace of > Metallica > http://www.phildev.net/ http://www.phildev.net/ <http://www.phildev.net/ http://www.phildev.net/> > http://www.ipom.com/ http://www.ipom.com/ <http://www.ipom.com/ http://www.ipom.com/> > > "Be who you are and say what you feel, because those who > mind don't > matter and those who matter don't mind." > - Dr. Seuss > > > _______________________________________________ > scale-infra mailing list > scale-infra@lists.linuxfests.org mailto:scale-infra@lists.linuxfests.org > <mailto:scale-infra@lists.linuxfests.org mailto:scale-infra@lists.linuxfests.org> > https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra <https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra> > > > _______________________________________________ > scale-infra mailing list > scale-infra@lists.linuxfests.org mailto:scale-infra@lists.linuxfests.org > https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra
-- Phil Dibowitz phil@ipom.com mailto:phil@ipom.com Open Source software and tech docs Insanity Palace of Metallica http://www.phildev.net/ http://www.phildev.net/ http://www.ipom.com/ http://www.ipom.com/
"Be who you are and say what you feel, because those who mind don't matter and those who matter don't mind." - Dr. Seuss
_______________________________________________ scale-infra mailing list scale-infra@lists.linuxfests.org mailto:scale-infra@lists.linuxfests.org https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra
scale-infra mailing list scale-infra@lists.linuxfests.org https://lists.linuxfests.org/cgi-bin/mailman/listinfo/scale-infra