mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joris Van Remoortere" <joris.van.remoort...@gmail.com>
Subject Re: Review Request 36061: Slave exits gracefully on DNS lookup failure.
Date Wed, 01 Jul 2015 17:19:42 GMT


> On July 1, 2015, 7:20 a.m., Adam B wrote:
> > I'm sorry, but I don't understand how LOG(FATAL) was segfaulting before. Please
explain.

Hi Adam,
This was reported as a segfault by a user in IRC. The specific segfault was fixed by MESOS-2636.
The segfault _was_ happening under an error condition, so after the segfault was fixed, we
would _still_ exit with a SIGABRT / stack trace that is generated by a `LOG(FATAL)`.

There has been a drive to reduce the number of bad user experiences by converting some of
the exit points that SIGABRT / stack trace when just a friendly message to the user would
be much more helpful.
I acknowledge that `PLOG(FATAL)` already prints out an error message; however, it is rather
hidden amongst the other messages and stack trace. Users tend to miss this and complain that
Mesos is broken, rather than fixing the user / configuration problem mentioned in the error
message.

Does this answer your question?


> On July 1, 2015, 7:20 a.m., Adam B wrote:
> > 3rdparty/libprocess/src/process.cpp, line 899
> > <https://reviews.apache.org/r/36061/diff/2/?file=996273#file996273line899>
> >
> >     I'm confused about how this was segfaulting before. "Logging a FATAL message
terminates the program (after the message is logged)", so unless `ip` is invalid or `ip.error()`
doesn't exist (yet `ip.isError()==true`), the process should exit cleanly after logging the
message.
> >     
> >     The original ticket (MESOS-2636) discusses segfaults within the actual `getIP()`
and `hostname()` methods, so the process would have segfaulted before reaching the LOG(FATAL).
Now that MESOS-2636 has been fixed, we should safely get back an `ip.error()` that we can
actually log.

Please see my top-level comment. This ticket is confusing because it's a mix of people running
old code and carrying the terminology into a rebased patch.
This ticket is about cleaning up the communication of the error message, not fixing the segfault.


> On July 1, 2015, 7:20 a.m., Adam B wrote:
> > 3rdparty/libprocess/src/process.cpp, lines 890-891
> > <https://reviews.apache.org/r/36061/diff/2/?file=996273#file996273line890>
> >
> >     Why doesn't this need to be an `EXIT_(EXIT_FAILURE)` as well?

We can consider that as well. The main difference is that `net::getIP()` is returning a `Try<net::IP>`,
whereas gethostname is still a raw system call.
In an ideal world we would be more sure that something that returns a `Try<T>` for a
function like this would only error out if there really was a useful error message for the
user and otherwise abort if there was a programmer error.


- Joris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36061/#review90014
-----------------------------------------------------------


On July 1, 2015, 12:32 a.m., Marco Massenzio wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36061/
> -----------------------------------------------------------
> 
> (Updated July 1, 2015, 12:32 a.m.)
> 
> 
> Review request for mesos, Adam B and Joris Van Remoortere.
> 
> 
> Bugs: MESOS-2962
>     https://issues.apache.org/jira/browse/MESOS-2962
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Jira: MESOS-2962
> 
> Slave fails with Abort stacktrace when DNS cannot resolve hostname
> 
> If the DNS cannot resolve the hostname for a slave node, we correctly return an Error
object, but we then fail with a segfault.
> 
> This code adds a more user-friendly message and exits normally (with an `EXIT_FAILURE`
code).
> For example, forcing `net::getIp()` to always return an Error, now causes the slave to
exit like this:
> ```
> $ ./bin/mesos-slave.sh --master=10.10.1.121:5405
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> E0630 11:31:45.777465 1944417024 process.cpp:899] Could not obtain the IP address for
stratos.local; the DNS service may not be able to resolve it: >>> Marco was here!!!
> 
> $ echo $?
> 1
> ```
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/src/process.cpp d99947c1598c43c47c88ef3e8038081855f0d1dc 
> 
> Diff: https://reviews.apache.org/r/36061/diff/
> 
> 
> Testing
> -------
> 
> make check
> and manual failing the DNS
> 
> 
> Thanks,
> 
> Marco Massenzio
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message