lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shad Storhaug <s...@shadstorhaug.com>
Subject Debugging Help Requested
Date Thu, 06 Jul 2017 21:32:06 GMT
Vincent,

If you have the time, I'd appreciate your assistance with a fix for a long-standing concurrency
bug. I have been putting together wrapper console application for the various utilities that
ship with Lucene and discovered that 2 of them are non-functional because of this bug, but
on the upside is now there is a reliable way to reproduce it. I suspect the bug is also causing
some of the random test failures that we are seeing on certain FSDirectory implementations.

I have pushed the WIP application to my local repository (https://github.com/NightOwl888/lucenenet/tree/cli/src/tools/lucene-cli).
It only runs on .NET Core and in Visual Studio 2015 Update 3. I don't think it makes sense
to support .NET framework for this utility since .NET Core will run side-by-side with .NET
Framework anyway.

You can run a specific commands directly on the command line or in Visual Studio 2015. There
is a server that needs to be started first, and then a client that connects. The problem seems
to be the server.

Command Line

dotnet lucene-cli.dll lock verify-server 127.0.0.4 10

dotnet lucene-cli.dll lock stress-test 3 127.0.0.4 <THE_PORT> NativeFSLockFactory F:\temp2
50 10

Note the port is dynamically chosen by the server at runtime and displayed on the console.

Visual Studio 2015

In Visual Studio 2015, you can just copy everything after "dotnet lucene-cli.dll" and paste
it into the project properties > Debug > Application Arguments text box. Do note I am
not sure if those options are optimal (or even if they may be causing the issue).

What I Have Found

When the client calls the server, the server locks on LockVerifyServer.cs line 129 (https://github.com/NightOwl888/lucenenet/blob/cli/src/Lucene.Net/Store/LockVerifyServer.cs#L129).
I tried removing that line, and it gets a bit further and then crashes with this error:

An unhandled exception of type 'System.Exception' occurred in System.Private.CoreLib.ni.dll

Additional information: System.IO.IOException: Unable to read data from the transport connection:
An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException:
An existing connection was forcibly closed by the remote host

   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags
socketFlags)

   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)

   --- End of inner exception stack trace ---

   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)

   at System.IO.Stream.ReadByte()

   at System.IO.BinaryReader.InternalReadOneChar()

   at Lucene.Net.Store.LockVerifyServer.ThreadAnonymousInnerClassHelper.Run() in F:\Projects\lucenenet\src\Lucene.Net\Store\LockVerifyServer.cs:line
135


I suspect that has something to do with removing the wait so the timing is off, but I compared
the thread handling code to some similar tests and it looks the same (including the call to
Wait()), so I haven't worked out why that method call isn't completing in this case.

I believe this bug is related to a couple of intermittently failing tests that also seem to
indicate the LockFactory is broken.

https://teamcity.jetbrains.com/viewLog.html?buildId=1101813&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451
https://teamcity.jetbrains.com/viewLog.html?buildId=1084071&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451
https://teamcity.jetbrains.com/viewLog.html?buildId=1071425&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451

Namely, the TestLockFactory.StressTestLocks and TestLockFactory.TestStressLocksNativeFSLockFactory
tests.


FYI, the TestIndexWriter.TestTwoThreadsInterruptDeadlock test also fails intermittently, and
is apparently concurrency related. I don't recall which tests they were, but I discovered
a while back that if you put the [Repeat(20)] attribute on them, they would fail more consistently.
I also noticed that they always fail if MMapDirectory is made as the only option provided
by the test framework.

Anyway, I would really appreciate if you could have a look to see if you can work out what
is going on.


Thanks,
Shad Storhaug (NightOwl888)



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message