lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shad Storhaug <s...@shadstorhaug.com>
Subject RE: Benchmark Concurrency Bug
Date Tue, 01 Aug 2017 05:46:02 GMT
Vincent,

I am curious to know what you meant by "something more lightweight". 

After taking a deeper dive into HTML Agility Pack, it is DOM based (it can be read from a
stream or TextWriter into the DOM, though). Although one might argue that HTML documents are
never going to be very large anyway, it feels inherently wrong to me to put together a solution
that you know in advance isn't going to scale. In this instance one could even argue that
we don't have true apples-to-apples performance comparison with Lucene because the loading
of the document takes place before the parsing begins (which is the only place that is concurrent).

Do you know of an alternative stream-based HTML parsing solution than TagSoup?

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, July 31, 2017 9:36 PM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: Benchmark Concurrency Bug

Thanks!

I knew it had to be something simple. It looks like that error is happening because of a race
condition. Oh well, it probably isn't worth the effort considering the intended audience of
the tool.

I hope you are feeling better soon.

From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Monday, July 31, 2017 8:07 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Benchmark Concurrency Bug

Hello Shad,

There are 2 causes for the tests TestOneDocument/TestTwoDocuments never to terminate:

Cause 1: the Parser.Run() method is never called. In the Java code, this type implements IRunnable,
but here it doesn't. The thread is supposed to be started at the first call to Parser.Next()
but does absolutely nothing:

                if (t == null)
                {
                    threadDone = false;
                    t = new ThreadClass(/*this*/);
                    t.SetDaemon(true);
                    t.Start();
                }

The minimal solution is to define a new class:

              private class MyThreadClass: ThreadClass
              {
                     private readonly Action m_Run;

                     public MyThreadClass(Action run)
                     {
                           m_Run = run;
                     }

                     public override void Run()
                     {
                           m_Run();
                     }
              }


And change  the above code to:

                if (t == null)
                {
                    threadDone = false;
                     t = new MyThreadClass(Run);
                    t.SetDaemon(true);
                    t.Start();
                }

This will cause progress, but the tests will still fail. The reason is that the code to create
the XmlReader:

                    Sax.Net.IXmlReader reader = XmlReaderFactory.Current.CreateXmlReader();
//XMLReaderFactory.createXMLReader();


... fails becasuse XmlReaderFactory.Current expects the reader type to be loaded from configuration
files. Alas, something happens on its way to the forum and you get a "null reference exception"
preceded by a "thread abort exception, causing the tests to fail because the reader is never
created.

I had half a mind to replace the Sax parser (which is an idiom that is not implemented in
.NET) by something more lightweight, but since I'm feeling a bit under the weather, I just
changed the line to:

Sax.Net.IXmlReader reader = new TagSoup.Net.XmlReaderFactory().CreateXmlReader();

...and be done with it. And on my machine, the tests pass now. I hope they do too on your
special machine <g>


The test TestForever() works as well, but ends with an exception (which is swallowed):

System.ObjectDisposedException: Cannot access a closed Stream.
   at System.IO.__Error.StreamIsClosed()
   at System.IO.MemoryStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.StreamReader.ReadBuffer()
   at System.IO.StreamReader.Read()
   at TagSoup.Net.HTMLScanner.Scan(TextReader r, IScanHandler h)

The reason is that the parse call:

      reader.Parse(new InputSource(IOUtils.GetDecodingReader(localFileIS, Encoding.UTF8)));


... seems to want the StreamReader (and by default, the memory stream), after the source.Dispose()
is called, Since the test passes, I'll pretend the problem doesn't exist.

Vincent


From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Monday, July 31, 2017 10:34 AM
To: Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com<mailto:Vincent.VanDenBerghe@bvdinfo.com>>
Cc: dev@lucenenet.apache.org<mailto:dev@lucenenet.apache.org>
Subject: Benchmark Concurrency Bug

Vincent,

I have pushed Benchmark to my branch here: https://github.com/NightOwl888/lucenenet/tree/benchmark.
There are 106/109 tests passing, but there are 3 tests here that never finish: https://github.com/NightOwl888/lucenenet/blob/benchmark/src/Lucene.Net.Tests.Benchmark/ByTask/Feeds/EnwikiContentSourceTest.cs#L29

There is also still one unfinished matter in that TagSoup/Sax.Net doesn't support .NET Standard.
It is a close match for Java's SAX parser, but so far the owner of the project has not replied
to my query whether he would be open to a PR. So, I have my eye on using the HTML Agility
Pack instead: https://www.nuget.org/packages/HtmlAgilityPack. If the concurrency bug happens
to have something to do with Sax.Net, feel free to replace it with the HTML Agility Pack.

I would appreciate if you could have a look at this when you have a chance.

Thanks,
Shad Storhaug (NightOwl888)

Mime
View raw message