Codecov and CyberShadow failure

Feb 08, 2017

RazvanN

Feb 08, 2017

Jack Stouffer

Feb 09, 2017

Feb 12, 2017

Feb 13, 2017

Feb 09, 2017

On Wednesday, 8 February 2017 at 17:30:53 UTC, RazvanN wrote: > I've noticed a couple of days ago that the 2 components mentioned in $title aren't working when making PRs. I don't have any experience with this, so what is there to be done? > > RazvanN Trying to narrow it down here: https://github.com/dlang/phobos/pull/5099

On Wednesday, 8 February 2017 at 17:30:53 UTC, RazvanN wrote: > I've noticed a couple of days ago that the 2 components mentioned in $title aren't working when making PRs. I don't have any experience with this, so what is there to be done? > > RazvanN CyberShadow is Vladimir Panteleev's nickname: https://github.com/cybershadow DAutoTest is his automated documentation tester that isn't working.

On Wednesday, 8 February 2017 at 21:05:45 UTC, Jack Stouffer wrote: > ... Still can't find the root cause. I'm also unable to recreate the problem locally using the same commands as the doc builder. We currently have nine PRs in the pipe ready to be merged once this error is nailed down. If anyone could lend a hand here, it would be very helpful.

February 12, 2017

Re: Codecov and CyberShadow failure

Posted by Vladimir Panteleev
in reply to Jack Stouffer

Permalink

Vladimir Panteleev

Posted in reply to Jack Stouffer

Permalink

On Thursday, 9 February 2017 at 17:41:09 UTC, Jack Stouffer wrote:
> On Wednesday, 8 February 2017 at 21:05:45 UTC, Jack Stouffer wrote:
>> ...
>
> Still can't find the root cause. I'm also unable to recreate the problem locally using the same commands as the doc builder.
>
> We currently have nine PRs in the pipe ready to be merged once this error is nailed down. If anyone could lend a hand here, it would be very helpful.

Apologies for that. I made the documentation tester mandatory a while ago, so extended downtime like this is unacceptable.

In the interest of public disclosure, here is the timeline and problems encountered:

- In response to some complaints about forum performance, I investigated sources of high I/O on the server, and identified the documentation tester as a major culprit. On 2017-02-06, I moved the working directory to a tmpfs (/dev/shm), which resulted in a dramatic improvement of I/O operations: https://dump.thecybershadow.net/d41c095b6a0dcdb7b827499a487b7c65/16%3A42%3A10-upload.png

- I've begun receiving reports on the autotester malfunctioning. In the process of debugging this problem, I've discovered a second problem: some files on the tmpfs would periodically disappear. This is what caused intermittent "file not found" errors.

- After some trial and error, I've identified the source of the second problem (an unusual systemd behaviour). I've adjusted the server configuration on 2017-02-09 to disable the behaviour.

- However, the first problem persisted (which manifested as compilation errors in the 2.073.0 version of Phobos). Finally, yesterday (2017-02-11) with some experimentation I've discovered that the root problem was a latent DMD bug which manifested only when the Phobos source files were being passed to it in a certain order, which happened to be the file iteration order on tmpfs. Details in the pull request: https://github.com/dlang/dlang.org/pull/1568

- Now that the PR is merged, master and stable are green again.

I accept that this shouldn't have taken a week to fix, and the initial change in question (tmpfs move) would have been better done in a test environment. FWIW, in parallel I've been working on a full-disk backup strategy to prepare for having one of the server's HDDs replaced. (We already have backups of critical data, but rebuilding from backups and reinstalling the system would result in downtime that can be avoided. The HDDs are already in RAID1 configuration, so the full disk backup is a precaution.)

On Sunday, 12 February 2017 at 15:30:40 UTC, Vladimir Panteleev wrote: > I accept that this shouldn't have taken a week to fix, and the initial change in question (tmpfs move) would have been better done in a test environment. FWIW, in parallel I've been working on a full-disk backup strategy to prepare for having one of the server's HDDs replaced. (We already have backups of critical data, but rebuilding from backups and reinstalling the system would result in downtime that can be avoided. The HDDs are already in RAID1 configuration, so the full disk backup is a precaution.) I bet that wasn't easy to find; you must have been tearing your hair out while debugging. Also kudos for the disclosure.

Forums