Closed Bug 796840 Opened 12 years ago Closed 12 years ago

Regular "The handle is invalid" errors trying to build with mach

Categories

(Firefox Build System :: Mach Core, enhancement)

x86_64
Windows 7
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla18

People

(Reporter: markh, Unassigned)

Details

Attachments

(2 files)

I usually see mach builds fail with:

--
Parent process exited without                                                   killing children, attempting to kill children
Parent process exited without                                                   killing children, attempting to kill children
Error Code 6 trying to query IO Completion Port, exiting
Exception in thread Thread-3:
Traceback (most recent call last):
  File "c:\mozilla-build\python\lib\threading.py", line 552, in __bootstrap_inner
    self.run()
  File "c:\mozilla-build\python\lib\threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "o:\src\mm\mozilla-hg\mc-socialapi-landing\testing/mozbase/mozprocess\mozprocess\processhandler.py", line 282, in _procmgr
    self._poll_iocompletion_port()
  File "o:\src\mm\mozilla-hg\mc-socialapi-landing\testing/mozbase/mozprocess\mozprocess\processhandler.py", line 331, in _poll_iocompletion_port
    raise WinError(errcode)
WindowsError: [Error 6] The handle is invalid.

1643.46 2011 compiler warnings present.
--

Note the spaces in the first 2 lines really are there :).

The small amount of debugging I've done shows that the invalid handle error is because self._io_port is None, so something is getting a little confused.  Also, I see another potential foot-gun in that code - specifically:

                    portstatus = winprocess.GetQueuedCompletionStatus(self._io_port,
                                                                      byref(msgid),
                                                                      byref(compkey),
                                                                      byref(pid),
                                                                      5000)

                    # If the countdowntokill has been activated, we need to check
                    # if we should start killing the children or not.
                    if countdowntokill != 0:
                        ...

                    if not portstatus:
                        # Check to see what happened
                        errcode = winprocess.GetLastError()

The "if countdowntokill" block could easily/likely change the GetLastError() result, depending on exactly what the '...' block winds up doing.
I instrumented the code - just after the call to GetQueuedCompletionStatus I added:

+                    if not portstatus:
+                        print "no port status(1) - self._io_port is", self._io_port, "last error is", winprocess.GetLastError()

and just after if not portstatus:
+                        print "no port status(2) - self._io_port is now", self._io_port

And the tail of the output is:
---
no port status(1) - self._io_port is <AutoHANDLE object at 0x02AB8D50> last error is 258
no port status(2) - self._io_port is now <AutoHANDLE object at 0x02AB8D50>
no port status(1) - self._io_port is <AutoHANDLE object at 0x02AB8D50> last error is 258
Parent process exited without                                                   killing children, attempting to kill children
no port status(2) - self._io_port is now None
no port status(1) - self._io_port is None last error is 6
Parent process exited without                                                   killing children, attempting to kill children
no port status(2) - self._io_port is now None
Error Code 6 trying to query IO Completion Port, exiting
Exception in thread Thread-3:
Traceback (most recent call last):
  File "c:\mozilla-build\python\lib\threading.py", line 552, in __bootstrap_inner
    self.run()
  File "c:\mozilla-build\python\lib\threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "o:\src\mm\mozilla-hg\mc-socialapi-landing\testing/mozbase/mozprocess\mozprocess\processhandler.py", line 282, in _procmgr
    self._poll_iocompletion_port()
  File "o:\src\mm\mozilla-hg\mc-socialapi-landing\testing/mozbase/mozprocess\mozprocess\processhandler.py", line 334, in _poll_iocompletion_port
    raise WinError(errcode)
WindowsError: [Error 6] The handle is invalid.
---

Note the first 2 lines repeat for a few minutes with nothing happening.

So what is happening is:
* GetQueuedCompletionStatus is returning 258 each 5 seconds for a few minutes.
* The "if countdowntokill:" block gets entered, which does the "kill" thing, which sets self._io_port to None and somewhere causes GetLastError() to reflect ERROR_INVALID_HANDLE.
* We enter the "if not portstatus" block and see the misleading invalid handle message.

So ultimately, the root problem is real - for some reason the build always stops doing anything useful after ~320 seconds, then we attempt to kill the process, then print a somewhat misleading error about the handle being invalid.

If I repeat the build, it continues for normal until the log indicates we are ~320 seconds through, then we see things "stall" for a few minutes until the countdowntokill block gets entered, then we see that message.  This can be repeated until the build eventually succeeds.
Version: 16 Branch → Trunk
Via the patch in bug 796880, I can see the root cause seems to be the process mspdbsrv.exe.  A quick google shows it is expected to hang around a little longer than usual and we need not wait on it (maybe it detaches from the job?)

So in the meantime, I try and manually kill this process once we get into that state - and see something I don't understand.  The PID I kill is 9880, and the output I see (with MOZPROCESS_DEBUG set) is:
---
324.82
324.82 nsNSSVersion.cpp
324.82
[here we are in the "strange" state doing nothing, so I kill the PID]
DBG::MOZPROC process id 2556 exited normally
DBG::MOZPROC process id 7188 exited normally
DBG::MOZPROC process id 6572 exited normally
DBG::MOZPROC process id 9880 exited normally
DBG::MOZPROC job object msg active processes zero
369.95 2011 compiler warnings present.
---

Not sure what those other pids are, and the build has *not* finished - it has terminated and more needs to be done - but the output looks as though everything terminated normally and everything is done.

I'll keep playing and keep spamming this bug as I find more...
procexp tells me that mspdbsrv.exe is still part of the job, and it was spawned by a parent that has already died (frm what google tells me, that's likely to be the first cl.exe spawned).

I hacked the code so that when the parent died I checked if mspdbsrc.exe was the only child, and as expected, this behaved just like when I manually killed it - a message it has terminated "normally", then the build stopped even though there was more to build.

Although now I see another strange message:
---
335.58
DBG::MOZPROC process id 9032 exited normally
DBG::MOZPROC process id 7124 exited normally
DBG::MOZPROC process id 10232 exited normally
DBG::MOZPROC job object msg active processes zero
DBG::MOZPROC NOT USING JOB OBJECTS!!!
---
Not sure why I see that last line.  I'll attach my WIP - bed-time!
Attached file wip for reference
This is the same mozprocess error from bug 795670.
So far I have determined that pymake is terminating somewhat early - but it is terminating with an exit code of zero.  I instrumented pymake to dump when it terminates, and the output I see is:

331.37 PYMAKE TERMINATING (<type 'exceptions.SystemExit'>, SystemExit(0,), <traceback object at 0x029A3F80>)
331.37 Traceback (most recent call last):
331.37   File "o:/src/mm/mozilla-hg/mc-socialapi-landing/build/pymake/make.py", line 22, in <module>
331.37     pymake.process.ParallelContext.spin()
331.37   File "o:\src\mm\mozilla-hg\mc-socialapi-landing\build\pymake\pymake\process.py", line 377, in spin
331.37     c.run()
331.37   File "o:\src\mm\mozilla-hg\mc-socialapi-landing\build\pymake\pymake\process.py", line 307, in run
331.37     cb(*args, **kwargs)
331.37 SystemExit: 0
DBG::MOZPROC process id 5364 exited normally
331.42 nsCertOverrideService.cpp
331.42 
331.42 TransportSecurityInfo.cpp
331.42 
331.42 nsKeygenHandler.cpp
331.42 
331.43 nsNSSCertValidity.cpp
331.43 
331.43 nsStreamCipher.cpp
331.43 
331.43 nsRandomGenerator.cpp


Note that pymake has terminated, but the child jobs have not - ie, the compilation of the .cpp files is continuing - hence my comment it seems to have terminated "early".  I can only speculate that the use of job objects is somehow confusing things.
Interesting observation about pymake. I wonder if mozprocess and/or mach is not handling SIGINT properly.
There are a number of things here all conspiring against me :)

* a normal "make" in an up-to-date tree should be a no-op - instead it actually does alot of work, including building .cpp files etc.

* mach causes output to appear in a different order - eg, the last output lines are the .cpp files being build - not the link for those .cpp files nor other stuff that actually happens later.  Setting PYTHONUNBUFFERED in the environment fixes this problem - but it caused me to think the build was terminating early - it wasn't.

* Building with VS2010 causes an mspdbsrv.exe process to be created in the "background" (implicitly by the other ms tools), but in the normal case this doesn't matter - it just does its thing and terminates later.  However, under mach it gets added to the job, so at the end of the build mach sees this process still running and eventually decides it should terminate it.

* The termination code in mozprocess that tries to kill this process is a little broken - it ends up reporting the error originally described in this bug, and also still fails to terminate the mspdbsrv.exe process.  The latter isn't a problem, but the former is - it takes a few minutes to decide the child process should be terminated before reporting an obscure error about a handle being invalid.

I've found that the problem re mspdbsrv.exe is fixed if mozbuild/base.py arranges for _run_make to pass ignore_children=True to the process runner - which is possibly quite reasonable - make/pymake typically does the right thing with child processes, so mozprocess need not get involved - it just ends up processing thousands of iocp messages and still ends up getting confused by the mspdbsrv.exe process.  How to best handle the mis-ordering of the output beyond setting that environment variable still escapes me...
> * The termination code in mozprocess that tries to kill this process is a little broken - it ends up reporting the error originally described in this bug, and also still fails to terminate the mspdbsrv.exe process.  The latter isn't a problem, but the former is - it takes a few minutes to decide the child process should be terminated before reporting an obscure error about a handle being invalid.

Could you please ticket at Testing::Mozbase and CC :ahal and myself?
(In reply to Mark Hammond (:markh) from comment #8)
> * mach causes output to appear in a different order - eg, the last output
> lines are the .cpp files being build - not the link for those .cpp files nor
> other stuff that actually happens later.  Setting PYTHONUNBUFFERED in the
> environment fixes this problem - but it caused me to think the build was
> terminating early - it wasn't.

Wait, what? mach should not change anything about output order. Currently, mach literally does a |make -f client.mk| captures the output line by line and then prints it back to the terminal. Now, perhaps stdout is being buffered and mach should flush on every stdout.write(). But, order should not have changed. The only way order could have changed is if mozprocess isn't preserving order or somehow some other process is inheriting and writing to stdout. Most weird.

> * Building with VS2010 causes an mspdbsrv.exe process to be created in the
> "background" (implicitly by the other ms tools), but in the normal case this
> doesn't matter - it just does its thing and terminates later.  However,
> under mach it gets added to the job, so at the end of the build mach sees
> this process still running and eventually decides it should terminate it.

Oh, what's this ignore_children foo in mozprocess? Is it doing something above and beyond the normal POSIX semantics for detecting when a process has ended? (session leader, all that fun stuff.)

> I've found that the problem re mspdbsrv.exe is fixed if mozbuild/base.py
> arranges for _run_make to pass ignore_children=True to the process runner -
> which is possibly quite reasonable - make/pymake typically does the right
> thing with child processes, so mozprocess need not get involved - it just
> ends up processing thousands of iocp messages and still ends up getting
> confused by the mspdbsrv.exe process.  How to best handle the mis-ordering
> of the output beyond setting that environment variable still escapes me...

I agree with your assessment about make doing the right thing with regard to process management. Let me code up a patch to pass ignore_children=True to mozprocess for make...
I'm not entirely sure what all is involved with ignore_children. But, the reasoning that make manages child processes and thus whatever calls make doesn't need to bother seems sound to me.
Attachment #668065 - Flags: review?(jhammel)
If mach is connecting the output of make to a pipe, then there's definitely a difference between running "make -f client.mk" (even if that's pymake) and "mach build". In the former case the stdouts of child processes should get connected directly to the terminal. In the latter case they'll all be connected to a pipe, which means you get Windows' brain-dead buffering in the way (it buffers 4k at a time, not line-based).
(In reply to Ted Mielczarek [:ted] from comment #12)
> If mach is connecting the output of make to a pipe, then there's definitely
> a difference between running "make -f client.mk" (even if that's pymake) and
> "mach build". In the former case the stdouts of child processes should get
> connected directly to the terminal. In the latter case they'll all be
> connected to a pipe, which means you get Windows' brain-dead buffering in
> the way (it buffers 4k at a time, not line-based).

I thought this was something that mozprocess magically took care of. When you run builds through mach, build output is instantaneous AFAICT. jhammel?
I'm not the expert here; mozprocess is pretty complicated and sadly I don't understand much of the nuances though I hope to fix up some of the tests to make it slightly more understandable.  

AIUI, mozprocess folds stderr into stdout: https://bug796840.bugzilla.mozilla.org/attachment.cgi?id=668065 . I would *assume* that this effectively makes stderr+stdout one time-based buffered, um, buffer with strict ordering.  I don't even know this for a fact on unix, let alone windows.

Output reading is *not* instantaneous: see https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/processhandler.py#L661 and try not to get confused. We read output a chunk at a time.

I don't know if we've solved the aforementioned windows problem or not, TBH.  Perhaps :ahal or :ctalbert or :jgriffin can comment.  It would also be nice to have a test for this, but since we don't test on windows currently, its fairly pointless.
Attachment #668065 - Flags: review?(jhammel) → review+
(In reply to Ted Mielczarek [:ted] from comment #12)
> If mach is connecting the output of make to a pipe, then there's definitely
> a difference between running "make -f client.mk" (even if that's pymake) and
> "mach build". In the former case the stdouts of child processes should get
> connected directly to the terminal. In the latter case they'll all be
> connected to a pipe, which means you get Windows' brain-dead buffering in
> the way (it buffers 4k at a time, not line-based).

Yes, this is indeed the case.  The output from mach is the same as what we see if the output of pymake is redirected.  Eg:

1% python .../make.py  # output order "correct".
2% python .../make > foo 2>&1 # output order "incorrect."
3% PYTHONUNBUFFERED=1 python .../make.py > foo 2>&1 # order "correct"
4% ./mach build # output order same as 2 above.
5% ./mach build > foo 2>&1 # output order same as 2 above.

So maybe mach sticking PYTHONUNBUFFERED in the environment when pymake is used is the reasonable - probably that, or arranging for the first pymake to be executed with -u, and hacking pymake so all child pymakes also get the -u.
I /think/ that if all we care about is output buffering then doing sys.stdout.flush() after every write to stdout from mach is sufficient.
Ignore children patch pushed.

https://hg.mozilla.org/mozilla-central/rev/a119cc1cdf0a
Status: NEW → ASSIGNED
Whiteboard: [leave open]
Target Milestone: --- → mozilla18
(In reply to Gregory Szorc [:gps] from comment #16)
> I /think/ that if all we care about is output buffering then doing
> sys.stdout.flush() after every write to stdout from mach is sufficient.

I doubt that will be the case, given pymake can demonstrate the same issue.  ie, I think that by the time mach *reads* the process output the order is already wrong.
(In reply to Jeff Hammel [:jhammel] from comment #9)
> > * The termination code in mozprocess that tries to kill this process is a little broken - it ends up reporting the error originally described in this bug, and also still fails to terminate the mspdbsrv.exe process.  The latter isn't a problem, but the former is - it takes a few minutes to decide the child process should be terminated before reporting an obscure error about a handle being invalid.
> 
> Could you please ticket at Testing::Mozbase and CC :ahal and myself?

Done - bug 798185.

So the only other outstanding issue is that pymake, when run under mach, has the output in the wrong order.  I'd suggest we open a new bug for that, then close this one as invalid as it has got too noisy to be useful.
(In reply to Jeff Hammel [:jhammel] from comment #14)
> I don't know if we've solved the aforementioned windows problem or not, TBH.
> Perhaps :ahal or :ctalbert or :jgriffin can comment.  It would also be nice
> to have a test for this, but since we don't test on windows currently, its
> fairly pointless.

It hasn't. I actually thought about trying to fix this at one point, so I wrote a unit test that fails. I'll file a new bug and attach that.
The only remaining issue was interaction between PYTHONUNBUFFERED and stdout from mach. I looked at the logging handlers and we do call stdout.flush() when writing logs to the terminal. Both the built-in logging.StreamHandler does it as well as mach's custom mach.terminal.LoggingHandler.

Since other bugs have been filed, we can now resolve this one.

Thanks everyone for helping to track down the issues!
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: