mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mesos Reviewbot Windows <revi...@mesos.apache.org>
Subject Re: Review Request 66644: WIP:Remove unknown unreachable tasks when agent re-registers.
Date Mon, 23 Apr 2018 20:58:10 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66644/#review201768
-----------------------------------------------------------



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['66644']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/66644

Relevant logs:

- [mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/66644/logs/mesos-tests-stdout.log):

```
[       OK ] OperationStatusUpdateManagerTest.RecoverNotCheckpointedStream (7 ms)
[ RUN      ] OperationStatusUpdateManagerTest.RecoverEmptyFile
[       OK ] OperationStatusUpdateManagerTest.RecoverEmptyFile (14 ms)
[ RUN      ] OperationStatusUpdateManagerTest.RecoverEmptyDirectory
[       OK ] OperationStatusUpdateManagerTest.RecoverEmptyDirectory (14 ms)
[ RUN      ] OperationStatusUpdateManagerTest.RecoverTerminatedStream
[       OK ] OperationStatusUpdateManagerTest.RecoverTerminatedStream (19 ms)
[ RUN      ] OperationStatusUpdateManagerTest.IgnoreDuplicateUpdate
[       OK ] OperationStatusUpdateManagerTest.IgnoreDuplicateUpdate (20 ms)
[ RUN      ] OperationStatusUpdateManagerTest.IgnoreDuplicateUpdateAfterRecover
[       OK ] OperationStatusUpdateManagerTest.IgnoreDuplicateUpdateAfterRecover (16 ms)
[ RUN      ] OperationStatusUpdateManagerTest.RejectDuplicateAck
[       OK ] OperationStatusUpdateManagerTest.RejectDuplicateAck (15 ms)
[ RUN      ] OperationStatusUpdateManagerTest.RejectDuplicateAckAfterRecover
[       OK ] OperationStatusUpdateManagerTest.RejectDuplicateAckAfterRecover (15 ms)
[ RUN      ] OperationStatusUpdateManagerTest.NonStrictRecoveryCorruptedFile
[       OK ] OperationStatusUpdateManagerTest.NonStrictRecoveryCorruptedFile (21 ms)
[ RUN      ] OperationStatusUpdateManagerTest.StrictRecoveryCorruptedFile
[       OK ] OperationStatusUpdateManagerTest.StrictRecoveryCorruptedFile (20 ms)
[ RUN      ] OperationStatusUpdateManagerTest.UpdateLatestWhenResending
[       OK ] OperationStatusUpdateManagerTest.UpdateLatestWhenResending (20 ms)
[----------] 16 tests from OperationStatusUpdateManagerTest (277 ms total)

[----------] 6 tests from PartitionTest
[ RUN      ] PartitionTest.PartitionedSlave
[       OK ] PartitionTest.PartitionedSlave (286 ms)
[ RUN      ] PartitionTest.PartitionedSlaveExitedExecutor
[       OK ] PartitionTest.PartitionedSlaveExitedExecutor (371 ms)
[ RUN      ] PartitionTest.TaskCompletedOnPartitionedAgent
```

- [mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/66644/logs/mesos-tests-stderr.log):

```
I0423 20:57:57.007681 18224 master.cpp:8517] Marked agent dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-S0
(winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net) unreachable: health check
timed out
I0423 20:57:57.007681 18224 master.cpp:10482] Updating the state of task 1 of framework dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-0000
(latest state: TASK_LOST, status update state: TASK_LOST)
I0423 20:57:57.009660 26088 hierarchical.cpp:609] Removed agent dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-S0
I0423 20:57:57.010648 18224 master.cpp:10581] Removing task 1 with resources cpus(allocated:
*):4; mem(allocated: *):2048; disk(allocated: *):1024; ports(allocated: *):[31000-32000] of
framework dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-0000 on agent dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-S0
at slave(87)@10.3.1.8:50409 (winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0423 20:57:57.010648 18224 master.cpp:8147] Sending status update TASK_LOST for task 1 of
framework dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-0000 'health check timed out'
I0423 20:57:57.011663 18224 master.cpp:10610] Removing executor 'default' with resources []
of framework dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-0000 on agent dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-S0
at slave(87)@10.3.1.8:50409 (winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0423 20:57:57.013650 18224 master.cpp:2045] Notifying framework dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-0000
(default) at scheduler-8199cd61-e811-4eac-a4b4-fcba52ad00ad@10.3.1.8:50409 of lost agent dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-S0
(winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0423 20:57:57.014663 30016 slave.cpp:5243] Handling status update TASK_FINISHED (Status UUID:
632972b0-c915-4622-a421-7a0c7e536d4b) for task 1 of framework dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-0000
from executor(31)@10.3.1.8:50409
I0423 20:57:57.015681 30016 slave.cpp:1253] Lost leading master
I0423 20:57:57.015681  4088 task_status_update_manager.cpp:181] Pausing sending task status
updates
I0423 20:57:57.016660 30016 slave.cpp:1315] Detecting new master
I0423 20:57:57.017660 30016 slave.cpp:1260] New master detected at master@10.3.1.8:50409
I0423 20:57:57.017660 29096 task_status_update_manager.cpp:181] Pausing sending task status
updates
I0423 20:57:57.017660 30016 slave.cpp:1315] Detecting new master
I0423 20:57:57.019656 30016 slave.cpp:1342] Authenticating with master master@10.3.1.8:50409
I0423 20:57:57.018729  4128 task_status_update_manager.cpp:328] Received task status update
TASK_FINISHED (Status UUID: 632972b0-c915-4622-a421-7a0c7e536d4b) for task 1 of framework
dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-0000
I0423 20:57:57.019656 30016 slave.cpp:1351] Using default CRAM-MD5 authenticatee
I0423 20:57:57.019656 26088 authenticatee.cpp:121] Creating new client SASL connection
I0423 20:57:57.019656 30016 slave.cpp:5644] Sending acknowledgement for status update TASK_FINISHED
(Status UUID: 632972b0-c915-4622-a421-7a0c7e536d4b) for task 1 of framework dc3b2518-4a8c-4d3a-bd8e-a36dfba3d82a-0000
to executor(31)@10.3.1.8:50409
I0423 20:57:57.021664 31320 master.cpp:9227] Authenticating slave(87)@10.3.1.8:50409
I0423 20:57:57.021664 32920 authenticator.cpp:98] Creating new server SASL connection
I0423 20:57:57.022657 30164 authenticatee.cpp:213] Received SASL authentication mechanisms:
CRAM-MD5
I0423 20:57:57.023687 30164 authenticatee.cpp:239] Attempting to authenticate with mechanism
'CRAM-MD5'
I0423 20:57:57.023687 30656 authenticator.cpp:204] Received SASL authentication start
I0423 20:57:57.023687 30656 authenticator.cpp:326] Authentication requires more steps
I0423 20:57:57.023687 28796 authenticatee.cpp:259] Received SASL authentication step
I0423 20:57:57.024680 33024 authenticator.cpp:232] Received SASL authentication step
I0423 20:57:57.024680 33024 authenticator.cpp:318] Authentication success
I0423 20:57:57.024680  4088 authenticatee.cpp:299] Authentication success
I0423 ```

- Mesos Reviewbot Windows


On April 23, 2018, 6:11 p.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66644/
> -----------------------------------------------------------
> 
> (Updated April 23, 2018, 6:11 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: 8750
>     https://issues.apache.org/jira/browse/8750
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> A RunTask messsage could get dropped for an agent while it's
> disconnected from the master and when such an agent goes unreachable
> then this dropped task message gets added to the unreachable tasks.
> When the agent re-registers, the master sends status updates for the
> tasks that the agent reported when re-registering and these tasks are
> also removed from the unreachableTasks on the framework but since the
> agent doesn't know about the dropped task so it doesn't get removed
> from the unreachableTasks leading to a check failure when
> this inconsistency is detected during framework removal.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 
>   src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a 
>   src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc 
> 
> 
> Diff: https://reviews.apache.org/r/66644/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message