lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Van Den Berghe, Vincent" <Vincent.VanDenBer...@bvdinfo.com>
Subject RE: Debugging Help Requested
Date Thu, 20 Jul 2017 11:11:16 GMT
Hello Shad,

I would say that MAX_CLIENTS is the maximum number of threads that are observing the lock
from within the verify-server process.
This would be more precise than just saying "the number of connected clients".



Vincent



From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Thursday, July 20, 2017 12:50 PM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Debugging Help Requested

Vincent,

Thanks. That did the trick.

I updated and created a single stream as you mentioned previously.

I tried creating documentation based on how I thought it worked. I am pretty sure we need
to change the description for "MAX_CLIENTS". Based on your best understanding of LockVerifyServer
and LockStressTest, is there anything here we need to change? https://github.com/NightOwl888/lucenenet/blob/master/src/tools/lucene-cli/docs/lock/index.md


Thanks,
Shad Storhaug (NightOwl888)

From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Thursday, July 20, 2017 5:17 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: RE: Debugging Help Requested

You should replace:

int command = intReader.ReadByte();
By:

              int command = @in.ReadByte();


... because you are using the return code of ReadByte() (which is -1)  to indicate EOF. A
BinaryWriter never returns -1, but an EOF exception if there is insufficient data. Only a
stream returns -1, which is why the return value of ReadByte() for a stream is an int, whereas
the one for a BinaryReader is a byte.

Vincent

From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Thursday, July 20, 2017 11:59 AM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com<mailto:Vincent.VanDenBerghe@bvdinfo.com>>
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: RE: Debugging Help Requested

Vincent,

You are my hero. You should wear a cape and learn to fly.

However, there must be something different about the code on your machine than mine because
although that gets the code running, I am now getting an exception:


F:\temp\cli>dotnet lucene-cli.dll lock verify-server 127.0.0.4 2
Listening on 53559...
All clients started, fire gun...


Unhandled Exception: Unhandled Exception: System.Exception: System.IO.EndOfStreamException:
Unable to read beyond the end of the stream.
   at System.IO.BinaryReader.ReadByte()
   at Lucene.Net.Store.LockVerifyServer.ThreadAnonymousInnerClassHelper.Run() in F:\Projects\lucenenet\src\Lucene.Net\Store\LockVerifyServer.cs:line
134 ---> System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
   at System.IO.BinaryReader.ReadByte()
   at Lucene.Net.Store.LockVerifyServer.ThreadAnonymousInnerClassHelper.Run()
   --- End of inner exception stack trace ---
   at Lucene.Net.Store.LockVerifyServer.ThreadAnonymousInnerClassHelper.Run()
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback
callback, Object state)
System.Exception: System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
   at System.IO.BinaryReader.ReadByte()
   at Lucene.Net.Store.LockVerifyServer.ThreadAnonymousInnerClassHelper.Run() in F:\Projects\lucenenet\src\Lucene.Net\Store\LockVerifyServer.cs:line
134 ---> System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
   at System.IO.BinaryReader.ReadByte()
   at Lucene.Net.Store.LockVerifyServer.ThreadAnonymousInnerClassHelper.Run()
   --- End of inner exception stack trace ---
   at Lucene.Net.Store.LockVerifyServer.ThreadAnonymousInnerClassHelper.Run()
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback
callback, Object state)


I have pushed what I have here: https://github.com/NightOwl888/lucenenet/blob/master/src/Lucene.Net/Store/LockVerifyServer.cs.
Could you compare with what you have and let me know what else needs to change?

Thanks,
Shad Storhaug (NightOwl888)


From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Thursday, July 20, 2017 3:19 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: RE: Debugging Help Requested

Hello Shad,

Bow before me, for I am the master of protocol violations. I assure you it's true: just ask
local law enforcement!

This has been exceedingly frustrating to debug. The reason for the bug is very simple: in
VerifyLockFactory, a SINGLE BYTE is written and read for verification:

                outerInstance.@out.WriteByte(message)<mailto:outerInstance.@out.WriteByte(message)>;
                outerInstance.@out.Flush()<mailto:outerInstance.@out.Flush()>;
                int ret = outerInstance.@in.ReadByte()<mailto:outerInstance.@in.ReadByte()>;

... but in LockVerifyServer, a CHARACTER is read and a 4 BYTE INTEGER is written:

                            int command = intReader.Read();
                                  ...

                                intWriter.Write(command);
                                os.Flush();

(because the writer is created using default parameters, a character is always 1 byte because
the value is 0 or 1 and not a lead-in value for a multibyte character).

Since sockets write their bytes in network order, writing a 01  as an integer writes 01 00
00 00.
So this happens:

1.       Server acquires lock and writes 01

2.       Client reads 01  and writes 01 00 00 00

3.       Server reads the first byte, which is 01: protocol matches, and 3 bytes are left
unread: 00 00 00

4.       Server releases lock and writes 00

5.       Client reads 00 and writes 00 00 00 00. There are now 7 bytes waiting to be read:
00 00 00 00 00 00 00

6.       Server reads 00 (the first unread byte in step 3): protocol matches. 6 bytes are
left unread 00 00 00 00 00 00

7.       Server acquires lock and write 01

8.       Client reads 01 and writes 01 00 00 00. There are now 6 + 4 == 10  bytes waiting
to be read 00 00 00 00 00 00 01 00 00 00

9.       Server reads 00 (the second unread byte in step 3): PROTOCOL VIOLATION

This is a confusing sequence of events, because steps 1 through 8 give the impression of being
correct, whereas everything is wrong from step 2 onwards and we only see it 7 steps later.
The fact that the steps were distributed over 2 processes didn't exactly help. At least in
the beginning.

The solution is to change the following in LockVerifyServer.cs:
Replace:

                            int command = intReader.Read();

By:

                           int command = stream.ReadByte();

Note that although intReader.Read() "works", it is not correct: we don't want to read characters,
but a single byte!
And replace:

                                intWriter.Write(command);

By:

                                os.WriteByte((byte)command);

This doesn't mean there isn't something weird about FSNativeLockFactory. I'd make the changes
outlined in my previous e-mail, i.e. change the concurrent dictionary definition with a static
variant of lazily created values:

        internal static readonly ConcurrentDictionary<string, Lazy<NativeFSLock>>
_locks = new ConcurrentDictionary<string, Lazy<NativeFSLock>>();

... changing the Add code to:

       var path = new DirectoryInfo(System.IO.Path.Combine(m_lockDir.FullName, lockName));
       return _locks.GetOrAdd(path.FullName, s => new Lazy<NativeFSLock>(() =>
new NativeFSLock(this, m_lockDir, s))).Value;


And the remove code to:

              Lazy<NativeFSLock> _;
              outerInstance._locks.TryRemove(path.FullName, out _);


Vincent



From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Saturday, July 15, 2017 3:28 PM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com<mailto:Vincent.VanDenBerghe@bvdinfo.com>>
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: RE: Debugging Help Requested

> I'm still at loss to explain the behavior of LockVerifyServer and LockStressTest, but
here is what I found so far.

Welcome to my world. You are looking at code that was contributed a few years ago. And since
it has no test coverage, this is probably the first time anyone has ever attempted to run
it. I made a few changes to the original translation of the socket code and I think it is
now working but even that is uncertain. If you drop back to the beta-00004 tag you can see
what it looked like when I started:

https://github.com/apache/lucenenet/blob/Lucene.Net_4_8_0_beta00004/src/Lucene.Net/Store/LockVerifyServer.cs
https://github.com/apache/lucenenet/blob/Lucene.Net_4_8_0_beta00004/src/Lucene.Net/Store/LockStressTest.cs

I have no objections if you want to port these from scratch or rework to be more .NET friendly.

> When I tried to reproduce your problem, I got stuck in s.Accept() of LockVerifyServer.cs.
The fact that you don't see this is baffling: s.Accept is designed to hang waiting for a connection,
which is a weird way to create listening threads on sockets.
> Perhaps the behavior is different in .NET Core, but in .NET 4.5.2 s.Accept() wil hang
waiting for a connection and since it's in a loop the program will hang until there are maxClients
clients connected.

As I mentioned, I wasn't getting anywhere until I commented out the startingGun.Wait() line
(the one someone labeled "not sure about this"). That was probably the wrong thing to do,
but it did end up getting me further in the process instead of stranded at no response.

Apparently, I misinterpreted the fact that maxClients is actually minClients (I wonder why).
I set the number to 10, but only connected 1 client so it didn't kick off the process.

> When one examines NativeFSLockFactory, several things pop up that make you say "hmmm".
Sadly, none of them correct the behavior.
> First, you have this:

            return _locks.GetOrAdd(lockName, (s) => new NativeFSLock(this, m_lockDir, s));

> This isn't exactly wrong, but it does create unnecessary garbage: there is no guarantee
that the valueFactory (the delegate creating your new NativeFSLock) is called only once. It
can be called an arbitrary number of times. The only thing that GetOrAdd guarantees is that
the value will be inserted in the dictionary once with no errors.
> If you must use a ConcurrentDictionary, the standard solution is to change the definition
of the value type to use Lazy<TValue>, like this:

        // LUCENENET: NativeFSLocks in Java are infact singletons; this is how we mimick that
to track instances and make sure
        // IW.Unlock and IW.IsLocked works correctly
        internal readonly ConcurrentDictionary<string, Lazy<NativeFSLock>> _locks
= new ConcurrentDictionary<string, Lazy<NativeFSLock>>();

In Lucene, this _locks variable didn't exist. I tried going through the javadocs to see if
I could find any validity to the claim that NativeFSLocks are singletons in Java, but I found
no such evidence. I haven't gone through the source code to find out, though - maybe if one
digs deep enough the comment can be confirmed. If you look at the way it was in Lucene, they
simply returned a new NativeFSLock every time. Perhaps it would make more sense to make that
class into a singleton?

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/store/NativeFSLockFactory.java#L105

> Something totally unrelated: not wanting to be pushy or anything, but what I'd really
like to do is to start experimenting with Lucene.Net.Replication. Any chance of that appearing
in the main soon?

That is a question for sir Jens Melgaard, the awesome contributor who is making that happen.
I took a look at his branch and it doesn't seem he has pushed his latest work to GitHub yet,
so the last 5 tests passing is not yet available. https://issues.apache.org/jira/projects/LUCENENET/issues/LUCENENET-565.
Hopefully he submits a pull request soon, but you are welcome to ask about his progress.

But now that you mention it I am a bit stuck because I haven't worked out how to make the
new lucene-cli tool compile on the command line. Actually, it is the restore command that
is the problem. Microsoft is absolutely of no help, and is unwilling to help anyone who is
on the project.json format (which implicitly means that we are stuck on preview 2 of the CLI
tool instead of the production one). I reported an issue a few days ago (https://github.com/NuGet/Home/issues/5593)
that is blocking us from moving to the new .csproj format (thus the latest CLI tool) and they
have so far not provided any answers.

I suppose that could be worked around by unzipping the .nupkg file, correcting the version
information in the .nuspec file, and then re-zipping them in the build script. Or, we could
check in the .nuspec files that would normally auto-generate with the same dependencies as
.csproj into the repo and use the old NuGet pack tool (which may not even work because of
some new elements they added). But these are both hacky workarounds that should be unnecessary.
I don't understand why Microsoft has not fixed this obvious problem in what is supposed to
be production tooling (and seriously, I found a LOT of problems trying to work with the versioning,
but this is the only one I haven't found a workaround for).

The bottom line is we probably need to move on to the new .csproj format sooner rather than
later, and that might throw a wrench into what Jens is working on if his work is not merged
first. On the other hand, it might help him work past impossible issues he is now experiencing
because of the pre-release tooling as well.

UP FOR GRABS

But there are lots of other ways to help if you are looking for ideas: https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md#porting-work---up-for-grabs.

Two things that stand out (other than what I have already mentioned) that I don't have much
hope for solving myself are:


1.       The "native" C++ Directory implementations in Lucene.Net.Misc for Windows and Posix/Unix.

2.       An issue with interaction between Lucene.Net.ICU and icu-dotnet that is causing the
TestThaiAnalyzer.TestRandomStrings() and TestThaiAnalyzer.TestRandomHugeStrings() (and some
of the Highlighter tests)  to fail with an AccessViolationException (a fatal error in .NET
Core) because of "protected memory".

I took a short C++ course back in 1998, and have not touched it since. I am not sure exactly
how these Directories should be compiled. I might be able to work out how to port the Java
wrapper classes for these in order to call the C++ library(s) once it is compiled and call
them. However, it sounds like "native" Directories might be better than the other ones we
have...?

The second issue I have looked into and I suspect it is concurrency related. The icu-dotnet
library throws an AccessViolationException frequently during these tests. This causes total
meltdown of the test runner in .NET Core because this exception can no longer be caught. So
these tests are currently manually hard-coded to fail in that environment, otherwise the test
runner would fatally crash and never finish all of the tests. I tried to create a valid test
case to send to the icu-dotnet team by collecting all of the data that was passed to icu-dotnet
up to the point of failure, then replaying that data in a separate unit test. The problem
is, the unit test didn't fail even when it was run 2500 times in a row with the same set of
random data that made it fail in Lucene.Net. So, since the same data in the same order isn't
causing it to fail, I suspect it has something to do with 2 threads providing conflicting
data in the test that tries to use the same memory location in icu-dotnet. If we can prove
the problem is on their side (or ours) with a test, then we are on the way to a solution.

FYI - Note that we are not using the production icu-dotnet library because they do not yet
have .NET Core support. It should work with that library in .NET Framework (with the dependencies),
but unfortunately the temporary icu-dotnet package uses the same package ID with a higher
version number than the production one, so there is no way for the build to support both the
production and temporary ones simultaneously (or at least I haven't figured it out yet).


From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Saturday, July 15, 2017 4:33 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: RE: Debugging Help Requested

Hello Shad,

I'm still at loss to explain the behavior of LockVerifyServer and LockStressTest, but here
is what I found so far.

I made a lucene-cli.csproj file and a lucene-cli.project.js to make it compile correctly.
I had to make the latter file, since project.js confused the build process. I don't know if
both *project.js files can exist next to each other.

When I tried to reproduce your problem, I got stuck in s.Accept() of LockVerifyServer.cs.
The fact that you don't see this is baffling: s.Accept is designed to hang waiting for a connection,
which is a weird way to create listening threads on sockets.
Perhaps the behavior is different in .NET Core, but in .NET 4.5.2 s.Accept() wil hang waiting
for a connection and since it's in a loop the program will hang until there are maxClients
clients connected.
So I changed the creation loop as follows:

                for (int count = 0; count < maxClients; count++)
                {
                    threads[count] = new ThreadAnonymousInnerClassHelper(localLock, lockedID,
startingGun, s);
                    threads[count].Start();
                }

On the Thread.Run method, I needed to call cs.Accept(), and it doesn't hang anymore.
A minor observation: it is completely unnecessary to create separate streams on the same sockets
for reading and writing:  that's a Java-ism. You can just create one stream and a binary reader
and writer on the same stream. Also, the socket can be closed outside it all (even though
.NET guidelines tell us that Dispose should never throw exceptions, and we could replace the
IOUtils.Dispose by a using() statement on the socket):

Hence the code becomes:

                     public override void Run()
                     {
                           var s = cs.Accept();
                           try
                           {
                                  using (Stream stream = new NetworkStream(s))
                                  {
                                         BinaryReader intReader = new BinaryReader(stream);
                                         BinaryWriter intWriter = new BinaryWriter(stream);
                                         try
                                         {
                                                       ... same as original code ...
                                         }
                                         catch (IOException ioe)
                                         {
                                                throw new Exception(ioe.ToString(), ioe);
                                         }
                                         catch (Exception e)
                                         {
                                                // LUCENENET NOTE: We need to throw a new
exception
                                                // to ensure this is Exception and not some
other type.
                                                throw new Exception(e.ToString(), e);
                                         }
                                  }
                            }
                           finally
                           {
                                  IOUtils.DisposeWhileHandlingException(s);
                           }
                     }

But of course, the change of stream code is optional: it doesn't make any difference in behavior.

Now, when I run this according to your guidelines, I'm seeing a crash on the client first:

Connecting to server 127.0.0.1 and registering as client 3...
0% done.
An error occurred:
System.IO.IOException: Protocol violation.
   at Lucene.Net.Store.VerifyingLockFactory.CheckedLock.Verify(Byte message) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Store\VerifyingLockFactory.cs:line
64
   at Lucene.Net.Store.VerifyingLockFactory.CheckedLock.Obtain() in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Store\VerifyingLockFactory.cs:line
75
   at Lucene.Net.Store.Lock.Obtain(Int64 lockWaitTimeout) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Store\Lock.cs:line
115
   at Lucene.Net.Store.LockStressTest.Main(String[] args) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Store\LockStressTest.cs:line
152
   at Lucene.Net.Cli.LockStressTestCommand.Configuration.<>c.<.ctor>b__0_0(String[]
args) in D:\SOURCE\GitHub\lucenenet\src\tools\lucene-cli\commands\lock\lock-stress-test\LockStressTestCommand.cs:line
28
   at Lucene.Net.Cli.LockStressTestCommand.Run(ConfigurationBase cmd) in D:\SOURCE\GitHub\lucenenet\src\tools\lucene-cli\commands\lock\lock-stress-test\LockStressTestCommand.cs:line
53
   at Lucene.Net.Cli.LockStressTestCommand.Configuration.<.ctor>b__0_1() in D:\SOURCE\GitHub\lucenenet\src\tools\lucene-cli\commands\lock\lock-stress-test\LockStressTestCommand.cs:line
42
   at Lucene.Net.Cli.ConfigurationBase.<>c__DisplayClass3_0.<OnExecute>b__0()
in D:\SOURCE\GitHub\lucenenet\src\tools\lucene-cli\ConfigurationBase.cs:line 62
   at Lucene.Net.Cli.CommandLine.CommandLineApplication.Execute(String[] args) in D:\SOURCE\GitHub\lucenenet\src\tools\lucene-cli\CommandLine\CommandLineApplication.cs:line
304
   at Lucene.Net.Cli.CommandLineOptions.Parse(String[] args) in D:\SOURCE\GitHub\lucenenet\src\tools\lucene-cli\CommandLineOptions.cs:line
32 Specify --help for a list of available options and commands.

... this in turn triggers the crash on the server (which you are seeing) because an error
in the client simply makes it stop.

When one examines NativeFSLockFactory, several things pop up that make you say "hmmm". Sadly,
none of them correct the behavior.
First, you have this:

            return _locks.GetOrAdd(lockName, (s) => new NativeFSLock(this, m_lockDir, s));

This isn't exactly wrong, but it does create unnecessary garbage: there is no guarantee that
the valueFactory (the delegate creating your new NativeFSLock) is called only once. It can
be called an arbitrary number of times. The only thing that GetOrAdd guarantees is that the
value will be inserted in the dictionary once with no errors.
If you must use a ConcurrentDictionary, the standard solution is to change the definition
of the value type to use Lazy<TValue>, like this:

        // LUCENENET: NativeFSLocks in Java are infact singletons; this is how we mimick that
to track instances and make sure
        // IW.Unlock and IW.IsLocked works correctly
        internal readonly ConcurrentDictionary<string, Lazy<NativeFSLock>> _locks
= new ConcurrentDictionary<string, Lazy<NativeFSLock>>();

But (as they say in the commercials) wait, there's more. The remove operation looks like this:

                            NativeFSLock _;
                            outerInstance._locks.TryRemove(path.FullName, out _);

This will never work, because the key used for adding is just lockName, whereas the key used
for removal is path.FullName. The dictionary will only grow, never shrink!
Given this, I changed the add code to:

       var path = new DirectoryInfo(System.IO.Path.Combine(m_lockDir.FullName, lockName));
       return _locks.GetOrAdd(path.FullName, s => new Lazy<NativeFSLock>(() =>
new NativeFSLock(this, m_lockDir, s))).Value;


And the remove code to:

              Lazy<NativeFSLock> _;
              outerInstance._locks.TryRemove(path.FullName, out _);

Since the keys of the dictionary are the actual lock files, we could consider making _locks
a static: a single instance among all NativeFSLocks.
This would also get rid of the outerInstance member in the lock, replacing it with the outer
class name.

        internal static readonly ConcurrentDictionary<string, Lazy<NativeFSLock>>
_locks = new ConcurrentDictionary<string, Lazy<NativeFSLock>>();


Another thing I notice is the code used to delete the lock: I think the directory deletion
code is unnecessary: we're creating and deleting file locks, not directory locks.

But the protocols violations I'm seeing are really strange. I'll try to debug this another
way: to be continued!

Something totally unrelated: not wanting to be pushy or anything, but what I'd really like
to do is to start experimenting with Lucene.Net.Replication. Any chance of that appearing
in the main soon?


Vincent
From: Shad Storhaug
Sent: Friday, July 14, 2017 2:33 AM
To: 'Van Den Berghe, Vincent'
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: RE: Debugging Help Requested

Vincent,

Thanks. This was definitely the most important issue to fix. I have no fear of replacing a
buggy implementation with a functional one - there was no direct way to port the Java implementation
into .NET, so this wasn't another case of it not working because it didn't match Lucene.

I ended up making a few changes:


1.       According to MSDN, FileStream.Flush() could throw an IOException, which would skip
the Dispose() call on the file if it happened in a finally block. Looking at the call to base.Dispose()
in the try block, it actually calls the Flush() method on BufferedIndexOutput. So, it makes
more sense to override Flush() and call file.Flush(true), which ensures Flush() will work
consistently everywhere.

2.       I removed the loop that calls FSync instead of taking the call out of that method.
FSync is not called anywhere else, but this way can still be utilized by subclasses.

3.       Changed from new ReadOnlyHeapByteBuffer(BytesRef.EMPTY_BYTES, 0, 0) to  ByteBuffer.Allocate(0).AsReadOnlyBuffer().
We shouldn't be using the internal members that weren't meant for public use in Java.

You can review here: https://github.com/apache/lucenenet/commit/c0e7a25c4a5901a4b6bea4f92088b5135c1f550a

The TestStressLocks test looks stable. However, on .NET Core when I put [Repeat(20)] on the
TestStressLocksNativeFSLockFactory test I got a concurrent file access exception. I suspect
this has to do with the same LockFactory issue that is causing the CLI tool to fail, not the
MMapDirectory.

I also fixed the IndexInputSlicer: https://github.com/apache/lucenenet/commit/de62e95746f23d6026412e22c6f724fed03b5fa7

I'd appreciate if you could look into the LockFactory issue when you get a chance.


Thanks,
Shad Storhaug (NightOwl888)


From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Thursday, July 13, 2017 8:31 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: RE: Debugging Help Requested

>> but I found a similar problem when trying to get all of the tests to pass on the
MMapDirectory. When MemoryMappedFile.CreateViewAccessor() is called with more bytes on the
size parameter than the actual file size, it throws UnauthorizedAccessException.

You are correct, of course.  That is indeed the culprit. It also invalidates my "solution"
to catch UnauthorizedAccessException: the real problem is the implementation of MMapDirectory.

I had a look at https://issues.apache.org/jira/projects/LUCENENET/issues/LUCENENET-530 but
I have doubts about the scalability of the approach. The call to _memoryMappedFile.CreateViewStream(0,
0, MemoryMappedFileAccess.Read); needs a single continuous chunk of virtual memory as large
as the size of the mapped file. For a lot of large files , this might be problematic. The
"list of buffers" approach is probably the better approach. I do have a working implementation
based on MemoryMappedFile.CreateViewStream, if you're interested: it does work and corrects
all the problems I'm seeing, but I'm hesitant to just replace the existing implementation,
even if all the tests run.

However, by making minimal modifications, I was able to make all MMapDirectory tests run flawlessly,
including my own.
Here's what I did:
In FSDirectory.cs:

-          Implemented the file.Flush(true); call in the FSDirectory.FSIndexOutput, and commented
out the call to IOUtils.Fsync in FSDirectory.Fsync. This remains vitally important.

In MMapDirectory.cs:

-          In MMapDirectory.OpenInput, change the call from:

            var c = new FileStream(file.FullName, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite);



To:
                     var c = new FileStream(file.FullName, FileMode.Open, FileAccess.Read,
FileShare.ReadWrite);


-          In MMapDirectory.Map, replace

            long fileCapacity = length == 0 ? ushort.MaxValue : length;

            if (input.memoryMappedFile == null)
            {
#if NETSTANDARD
                input.memoryMappedFile = MemoryMappedFile.CreateFromFile(fc, null, fileCapacity,
MemoryMappedFileAccess.ReadWrite, HandleInheritability.Inheritable, false);
#else
                input.memoryMappedFile = MemoryMappedFile.CreateFromFile(fc, null, fileCapacity,
MemoryMappedFileAccess.ReadWrite, null, HandleInheritability.Inheritable, false);
#endif
            }


By:


                     if (length == 0)
                           return new[] { new ReadOnlyHeapByteBuffer(BytesRef.EMPTY_BYTES,
0, 0) };

                     if (input.memoryMappedFile == null)
                     {
#if NETSTANDARD
            input.memoryMappedFile = MemoryMappedFile.CreateFromFile(fc, null, length, MemoryMappedFileAccess.Read,
HandleInheritability.Inheritable, false);
#else
                           input.memoryMappedFile = MemoryMappedFile.CreateFromFile(fc, null,
length, MemoryMappedFileAccess.Read, null, HandleInheritability.Inheritable, false);
#endif
                     }

(you may need to add a using Util; inside the namespace Lucene.Net.Store to access BytesRef).


I've been running hundreds of iterations of the StressTestLocks and TestStressLocksNativeFSLockFactory
and haven't seen any single failure yet.


Vincent


From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Wednesday, July 12, 2017 4:00 PM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com<mailto:Vincent.VanDenBerghe@bvdinfo.com>>
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: RE: Debugging Help Requested

Vincent,

Thanks for looking into these.

[Finding 1]


>  Why does it throw UnauthorizedAccessException? I have no idea

This may not be related, but I found a similar problem when trying to get all of the tests
to pass on the MMapDirectory. When MemoryMappedFile.CreateViewAccessor() is called with more
bytes on the size parameter than the actual file size, it throws UnauthorizedAccessException.
See: https://stackoverflow.com/a/5501331/181087

Note that if you are on 64 bit Windows, the line FSDirectory.Open(@"E:\Temp\LuceneStressTest");
will always return MMapDirectory.

When recently going through the old JIRA issues, I noticed that someone contributed an MMapDirectory
implementation that was never actually used. Frankly, that "array of ByteBuffers" is probably
not a sound approach in .NET, so it is worth at least reviewing to see if this other implementation
is a better design approach. See: https://issues.apache.org/jira/projects/LUCENENET/issues/LUCENENET-530.
I was able to determine that MMapDirectory (or something it is dependent on) is definitely
causing 2 of the intermittent test failures by checking each Directory implementation individually
in the test framework.

If this is the source of the UnauthorizedAccessException, it makes more sense to fix the source
than to patch it in a catch block (which likely would need to be done in multiple places).



>  I know this isn't a direct answer to your debugging request on your .NET Core thing,
but I got zero experience on that .NET Core thing and thought that continuing an ongoing investigation
would be a more efficient use of my time.

Actually, the issue is not .NET Core-specific. If you feel more comfortable working on .NET
Framework, you could add a `.csproj` file to the lucene-cli tool (which is now on master)
and run the same thing on .NET Framework. But we have no plans to have a .NET Framework version
of this tool, since it won't really make any difference anyway on the command line (other
than the fact that .NET Framework only runs on Windows). You will need to adjust the first
part of the command-line commands to "lucene-cli.exe" (or whatever you put in the assembly
name box of project properties) rather than "dotnet lucene-cli.dll" in that case.

Thanks,
Shad Storhaug (NightOwl888)

From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Wednesday, July 12, 2017 8:36 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: RE: Debugging Help Requested

Something I forgot to mention wrt to the previous e-mail: the test application to reproduce
should be running in 32-bit ("32-bit preferred").
For some strange reason, the problem is different in 64-bit and is not corrected by it.
I'm starting to think the "access denied" is an antivirus problem... I'll keep you informed.

Vincent



From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Wednesday, July 12, 2017 7:32 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: RE: Debugging Help Requested

Hello Shad,

Ever since you've mentioned in your e-mail of Tue 6/13/2017 2:54 PM that we still have multithreading
problems, I've been attempting to track them down (and why I couldn't see them). Your previous
e-mail caused me to redouble my efforts.
I got 2 findings to report so far.

[Finding 1]
I've discovered at least one problem when  one thread is appending documents, and another
one is reading from the same directory.
At some point when the IndexWriter is disposed, I've observed one of two possible stack traces
(note that the lines are those of my own build):

                Lucene.Net.dll!Lucene.Net.Util.IOUtils.Fsync(string fileToSync, bool isDir)
Line 468          C#
                Lucene.Net.dll!Lucene.Net.Store.FSDirectory.Fsync(string name) Line 536  
        C#
                Lucene.Net.dll!Lucene.Net.Store.FSDirectory.Sync(System.Collections.Generic.ICollection<string>
names) Line 365                C#
                Lucene.Net.dll!Lucene.Net.Index.SegmentInfos.WriteSegmentsGen(Lucene.Net.Store.Directory
dir, long generation) Line 301       C#
                Lucene.Net.dll!Lucene.Net.Index.SegmentInfos.FinishCommit(Lucene.Net.Store.Directory
dir) Line 1261                C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.FinishCommit() Line 3792     
        C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.CommitInternal() Line 3775   
      C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.CloseInternal(bool waitForMerges,
bool doFlush) Line 1253                C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.Dispose(bool waitForMerges) Line
1093 C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.Dispose() Line 1051          C#

More specifically, in the above stack trace, this statement in IOUtils.Fsync  succeeded (fileToSync
is the "segments.gen" file):

   file = new FileStream(fileToSync,
                        FileMode.Open, // We shouldn't create a file when syncing.
                        // Java version uses FileChannel which doesn't create the file if
it doesn't already exist,
                        // so there should be no reason for attempting to create it in Lucene.Net.
                        FileAccess.Write,
                        FileShare.ReadWrite);

The other possible stack trace replaces the lines above FinishCommit() Line 3792 with

                Lucene.Net.dll!Lucene.Net.Store.FSDirectory.FSIndexOutput.FSIndexOutput(Lucene.Net.Store.FSDirectory
parent, string name) Line 467     C#
                Lucene.Net.dll!Lucene.Net.Store.FSDirectory.CreateOutput(string name, Lucene.Net.Store.IOContext
context) Line 323             C#
                Lucene.Net.dll!Lucene.Net.Index.SegmentInfos.WriteSegmentsGen(Lucene.Net.Store.Directory
dir, long generation) Line 289       C#
                Lucene.Net.dll!Lucene.Net.Index.SegmentInfos.FinishCommit(Lucene.Net.Store.Directory
dir) Line 1270                C#

... with a very similar statement which succeeds:

file = new FileStream(Path.Combine(parent.m_directory.FullName, name), FileMode.OpenOrCreate,
FileAccess.ReadWrite, FileShare.ReadWrite);


At the same time, I had a failing IndexReader with the following stack trace:

>             Lucene.Net.dll!Lucene.Net.Store.SimpleFSDirectory.OpenInput(string name,
Lucene.Net.Store.IOContext context) Line 89                C#
                Lucene.Net.dll!Lucene.Net.Store.Directory.OpenChecksumInput(string name, Lucene.Net.Store.IOContext
context) Line 115             C#
                Lucene.Net.dll!Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run(Lucene.Net.Index.IndexCommit
commit) Line 908              C#
                Lucene.Net.dll!Lucene.Net.Index.StandardDirectoryReader.Open(Lucene.Net.Store.Directory
directory, Lucene.Net.Index.IndexCommit commit, int termInfosIndexDivisor) Line 55       C#
                Lucene.Net.dll!Lucene.Net.Index.DirectoryReader.Open(Lucene.Net.Store.Directory
directory) Line 69                C#

And this statement in  SimpleFSDirectory.OpenInput fails (path.FullName is the same as fileToSync):

           var raf = new FileStream(path.FullName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);

This fails with the UnauthorizedAccessException. This exception is uncaught,  and propagates
out of the DirectoryReader, causing it to fail.
Why does it throw UnauthorizedAccessException? I have no idea: the 2 FileStream constructors
are compatible since their FileAccess/FileShare mode allows both reading and writing. But
it does fail with that exception.
To make sure it's not an artifact of the test framework, I've written the following small
program that reproduces the problem. It's very similar to the existing test in _testStressLocks,
but doesn't use mock-ups since the overhead seems to be a factor in reproducing it (on my
machine). You should be able to just cut & paste and see for yourself. If course, if you
don't I shall have made a fool of myself again.
Ah well. Here it goes:

using Lucene.Net.Index;
using Lucene.Net.Store;
using Lucene.Net.Util;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Search;
using System.Threading;

namespace StressTest
{
       class Program
       {
              public static volatile bool WriterThreadActive;

              private static void WriterThread(Directory directory, int iterations)
              {
                     WriterThreadActive = true;
                     try
                     {
                           for (int i = 0; i < iterations; ++i)
                                  using (var writer = new IndexWriter(directory, new IndexWriterConfig(LuceneVersion.LUCENE_48,
new StandardAnalyzer(LuceneVersion.LUCENE_48)) { OpenMode = OpenMode.APPEND }))
                                  {
                                         Document doc = new Document();
                                         doc.Add(new TextField("content", "aaa", Field.Store.NO));
                                         writer.AddDocument(doc);
                                  }
                     }
                     finally
                     {
                           WriterThreadActive = false;
                     }
              }

              private static void ReaderThread(Directory directory, int iterations)
              {
                     var query = new TermQuery(new Term("content", "aaa"));
                     for (int i = 0; WriterThreadActive || i < iterations; ++i)
                     {
                           using (var reader = DirectoryReader.Open(directory))
                           {
                                  var searcher = new IndexSearcher(reader);
                                  searcher.Search(query, null, 1000);
                           }
                     }
              }

              static void Main(string[] args)
              {
                     var directory = FSDirectory.Open(@"E:\Temp\LuceneStressTest");

                     using (var writer = new IndexWriter(directory, new IndexWriterConfig(LuceneVersion.LUCENE_48,
new StandardAnalyzer(LuceneVersion.LUCENE_48)) { OpenMode = OpenMode.CREATE }))
                     {
                           Document doc = new Document();
                           doc.Add(new TextField("content", "aaa", Field.Store.NO));
                           writer.AddDocument(doc);
                     }

                     const int MinimumIterations = 10000;     // you may need to increase
this

                     var writerThread = new Thread(() => WriterThread(directory, MinimumIterations))
{ Name = "WriterThread" };
                     var readerThread = new Thread(() => ReaderThread(directory, MinimumIterations))
{ Name = "ReaderThread" };
                     writerThread.Start();
                     readerThread.Start();
                     writerThread.Join();
                     readerThread.Join();
                     directory.Dispose();
              }
       }
}

You may tweak the E:\Temp\LuceneStressTest and the MinimumIterations accordingly.

The solution is to add the UnauthorizedAccessException to the catch clause in SegmentInfos.cs:
(SegmentInfos.FindSegmentFile.Run):

                        try
                        {
                            genInput = directory.OpenChecksumInput(IndexFileNames.SEGMENTS_GEN,
IOContext.READ_ONCE);
                        }
                        catch (IOException e)
                        {
                            if (infoStream != null)
                            {
                                Message("segments.gen open: IOException " + e);
                            }
                        }
                     catch (UnauthorizedAccessException e)
                     {
                           if (infoStream != null)
                           {
                                  Message("segments.gen open: UnauthorizedAccessException
" + e);
                           }
                     }

UnauthorizedAccessException doesn't inherit from IOException, you you need to catch both of
them.
After this modification, the problem (obviously) disappears. Maybe similar code would need
similar additions, but there is a lack of evidence to do so.

[Finding 2]
I am on the trail of the intermittent failure of StressTestLocks / TestStressLocksNativeFSLockFactory
but the margin of this e-mail is too small to contain it. More later.


Hopefully, this helps a bit.
I know this isn't a direct answer to your debugging request on your .NET Core thing, but I
got zero experience on that .NET Core thing and thought that continuing an ongoing investigation
would be a more efficient use of my time.


Vincent



From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Thursday, July 06, 2017 11:32 PM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com<mailto:Vincent.VanDenBerghe@bvdinfo.com>>
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: Debugging Help Requested

Vincent,

If you have the time, I'd appreciate your assistance with a fix for a long-standing concurrency
bug. I have been putting together wrapper console application for the various utilities that
ship with Lucene and discovered that 2 of them are non-functional because of this bug, but
on the upside is now there is a reliable way to reproduce it. I suspect the bug is also causing
some of the random test failures that we are seeing on certain FSDirectory implementations.

I have pushed the WIP application to my local repository (https://github.com/NightOwl888/lucenenet/tree/cli/src/tools/lucene-cli).
It only runs on .NET Core and in Visual Studio 2015 Update 3. I don't think it makes sense
to support .NET framework for this utility since .NET Core will run side-by-side with .NET
Framework anyway.

You can run a specific commands directly on the command line or in Visual Studio 2015. There
is a server that needs to be started first, and then a client that connects. The problem seems
to be the server.

Command Line

dotnet lucene-cli.dll lock verify-server 127.0.0.4 10

dotnet lucene-cli.dll lock stress-test 3 127.0.0.4 <THE_PORT> NativeFSLockFactory F:\temp2
50 10

Note the port is dynamically chosen by the server at runtime and displayed on the console.

Visual Studio 2015

In Visual Studio 2015, you can just copy everything after "dotnet lucene-cli.dll" and paste
it into the project properties > Debug > Application Arguments text box. Do note I am
not sure if those options are optimal (or even if they may be causing the issue).

What I Have Found

When the client calls the server, the server locks on LockVerifyServer.cs line 129 (https://github.com/NightOwl888/lucenenet/blob/cli/src/Lucene.Net/Store/LockVerifyServer.cs#L129).
I tried removing that line, and it gets a bit further and then crashes with this error:

An unhandled exception of type 'System.Exception' occurred in System.Private.CoreLib.ni.dll

Additional information: System.IO.IOException: Unable to read data from the transport connection:
An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException:
An existing connection was forcibly closed by the remote host

   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags
socketFlags)

   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)

   --- End of inner exception stack trace ---

   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)

   at System.IO.Stream.ReadByte()

   at System.IO.BinaryReader.InternalReadOneChar()

   at Lucene.Net.Store.LockVerifyServer.ThreadAnonymousInnerClassHelper.Run() in F:\Projects\lucenenet\src\Lucene.Net\Store\LockVerifyServer.cs:line
135


I suspect that has something to do with removing the wait so the timing is off, but I compared
the thread handling code to some similar tests and it looks the same (including the call to
Wait()), so I haven't worked out why that method call isn't completing in this case.

I believe this bug is related to a couple of intermittently failing tests that also seem to
indicate the LockFactory is broken.

https://teamcity.jetbrains.com/viewLog.html?buildId=1101813&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451
https://teamcity.jetbrains.com/viewLog.html?buildId=1084071&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451
https://teamcity.jetbrains.com/viewLog.html?buildId=1071425&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451

Namely, the TestLockFactory.StressTestLocks and TestLockFactory.TestStressLocksNativeFSLockFactory
tests.


FYI, the TestIndexWriter.TestTwoThreadsInterruptDeadlock test also fails intermittently, and
is apparently concurrency related. I don't recall which tests they were, but I discovered
a while back that if you put the [Repeat(20)] attribute on them, they would fail more consistently.
I also noticed that they always fail if MMapDirectory is made as the only option provided
by the test framework.

Anyway, I would really appreciate if you could have a look to see if you can work out what
is going on.


Thanks,
Shad Storhaug (NightOwl888)



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message