groovy-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillaume Laforge <glafo...@gmail.com>
Subject Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()
Date Tue, 09 Jun 2015 15:33:53 GMT
So now, how to decide what's best? :-)

Is a Java reader happy with the BOM? and eats it transparently? (I think in
the past that wasn't the case but I may be wrong)

2015-06-09 17:21 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:

> That's an excellent point, Paolo.  NioGroovyMethods.newWriter claims (in
> the JavaDoc) it will write the BOM if needed, but it doesn't because it
> uses Java's implementation rather than with Groovy's
> writeUTF16BomIfRequired.  None of the methods in NioGroovyMethods use
> writeUTF16BomIfRequired.
>
> Whichever we decide, we should be consistent.
>
> -Keegan
>
> On Tue, Jun 9, 2015 at 11:08 AM, Paolo Di Tommaso <
> paolo.ditommaso@gmail.com> wrote:
>
>> I'm wondering if NioGroovyMethods that implement the write methods for
>> Path should do the same.
>>
>>
>> Cheers,
>> Paolo
>>
>>
>> On Tue, Jun 9, 2015 at 4:02 PM, Keegan Witt <keeganwitt@gmail.com> wrote:
>>
>>> Cool.  I'll wait for PR 36 to be merged first, because I also was
>>> thinking the Javadoc would be changed from
>>>     is "UTF-16BE" or "UTF-16LE"
>>> to
>>>     is "UTF-16BE" or "UTF-16LE" (or an equivalent alias)
>>>
>>> -Keegan
>>>
>>>
>>> On Tue, Jun 9, 2015 at 9:08 AM, Guillaume Laforge <glaforge@gmail.com>
>>> wrote:
>>>
>>>>
>>>> 2015-06-09 15:04 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>
>>>>> Created GROOVY-7461
>>>>> <https://issues.apache.org/jira/browse/GROOVY-7461> and PR 36
>>>>> <https://github.com/apache/incubator-groovy/pull/36>.
>>>>>
>>>>
>>>> Cool!
>>>>
>>>>
>>>>> How would you feel about a PR to copy the Javadoc comment mentioning
>>>>> the UTF-16 BOM on File.newWriter to all the other methods that use
>>>>> writeUTF16BomIfRequired (at least until we decide we're going to
>>>>> change the current behavior)?
>>>>>
>>>>
>>>> Right, worth it!
>>>>
>>>>
>>>>>
>>>>> -Keegan
>>>>>
>>>>> On Tue, Jun 9, 2015 at 8:17 AM, Guillaume Laforge <glaforge@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Good point!
>>>>>>
>>>>>> 2015-06-09 14:11 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>
>>>>>>> That's only available in Java 7.  Isn't Groovy still targeting
1.6
>>>>>>> for the non-indy version?
>>>>>>>
>>>>>>> -Keegan
>>>>>>> On Jun 9, 2015 7:56 AM, "Guillaume Laforge" <glaforge@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Well spotted!
>>>>>>>>
>>>>>>>> You could also compare with the StandardCharset, instead
of going
>>>>>>>> through the name comparison:
>>>>>>>>
>>>>>>>> http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
>>>>>>>>
>>>>>>>> 2015-06-09 13:49 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>
>>>>>>>>> No, it's a Groovy bug.
>>>>>>>>>
>>>>>>>>> private static void writeUTF16BomIfRequired(final String
charset, final OutputStream stream) throws IOException {
>>>>>>>>>     if ("UTF-16BE".equals(charset)) {
>>>>>>>>>         writeUtf16Bom(stream, true);
>>>>>>>>>     } else if ("UTF-16LE".equals(charset)) {
>>>>>>>>>         writeUtf16Bom(stream, false);
>>>>>>>>>     }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> should be
>>>>>>>>>
>>>>>>>>> private static void writeUTF16BomIfRequired(final String
charset, final OutputStream stream) throws IOException {
>>>>>>>>>     if ("UTF-16BE".equals(Charset.forName(charset).name()))
{
>>>>>>>>>         writeUtf16Bom(stream, true);
>>>>>>>>>     } else if ("UTF-16LE".equals(Charset.forName(charset).name()))
{
>>>>>>>>>         writeUtf16Bom(stream, false);
>>>>>>>>>     }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> in org.codehaus.groovy.runtime.ResourceGroovyMethods.
 We'll
>>>>>>>>> probably want to fix that regardless of what we decide
on the
>>>>>>>>> *withPrintWriter* question.  I'll open a Jira and a PR.
>>>>>>>>>
>>>>>>>>> -Keegan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge <
>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> From Groovy's point of view (ie. when you're coding
in Groovy),
>>>>>>>>>> the BOM is automatically discarded when you use one
of our reader methods
>>>>>>>>>> (withReader, etc), so it's transparent whether the
BOM is here or not.
>>>>>>>>>>
>>>>>>>>>> I tend to think that having the BOM always is a good
thing (I
>>>>>>>>>> even thought that was mandatory), but Groovy should
guess the endianness
>>>>>>>>>> regardless anyway.
>>>>>>>>>>
>>>>>>>>>> Happy to hear what others think too about all this
though.
>>>>>>>>>>
>>>>>>>>>> Guillaume
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2015-06-08 23:20 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> The code as-is today writes the BOM regardless
of platform.  I
>>>>>>>>>>> just tested in Linux with the same results. 
I think there are 2 parts to
>>>>>>>>>>> the question of "what's the correct behavior?"
>>>>>>>>>>>
>>>>>>>>>>> 1.  Should the BOM be written at all, particularly
when the
>>>>>>>>>>> platform is Windows?
>>>>>>>>>>> 2.  Should the behavior of *withPrintWriter*
differ (even if
>>>>>>>>>>> the difference is to be smarter) from the behavior
of *new
>>>>>>>>>>> PrintWriter*?
>>>>>>>>>>>
>>>>>>>>>>> *Discussion*
>>>>>>>>>>> 1.  Strictly speaking, yes.  Because RFC 2781
>>>>>>>>>>> <http://tools.ietf.org/html/rfc2781> states
in section 4.3 to
>>>>>>>>>>> assume big endian if there is no BOM.  However,
in practice, many
>>>>>>>>>>> applications disregard the RFC and assume little-endian
because that's what Windows
>>>>>>>>>>> does
>>>>>>>>>>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>>>>>>>>>>> Because of this, the behavior could be changed
so that when writing
>>>>>>>>>>> UTF-16LE on Windows, it doesn't write the BOM.
 But in my opinion, it's
>>>>>>>>>>> best practice to always write a BOM when working
with UTF-16, and Java
>>>>>>>>>>> should have done this in their implementation
of their PrintWriter.
>>>>>>>>>>>
>>>>>>>>>>> 2.  This is a tough one.  Arguably, *withPrintWriter*
is doing
>>>>>>>>>>> the smarter, more correct behavior, but the typical
user would assume this
>>>>>>>>>>> is just a shorthand convenience for newing up
a PrintWriter (I certainly
>>>>>>>>>>> did).  So the question is, is it better to just
document this difference in
>>>>>>>>>>> the GroovyDoc?  Or to change the behavior to
be closer to Java?  And if the
>>>>>>>>>>> latter, what breakages would that cause within
Groovy itself?  Making that
>>>>>>>>>>> change could break folks in production, because
they could rely on that BOM
>>>>>>>>>>> being there, in cases for example where the file
is created on Windows, but
>>>>>>>>>>> then processed on Linux or when working with
a third party library that is
>>>>>>>>>>> more picky about the presence of a BOM.
>>>>>>>>>>>
>>>>>>>>>>> -Keegan
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge
<
>>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Now... is it what should be done or not is
the good question to
>>>>>>>>>>>> ask :-)
>>>>>>>>>>>> Does Windows manages to open UTF-16 files
without BOMs?
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-06-08 22:17 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> I forgot to mention that.  Yes, I ran
the test mentioned in
>>>>>>>>>>>>> Windows.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume
Laforge <
>>>>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> That's a good question.
>>>>>>>>>>>>>> I guess this is happening on Windows?
(I haven't tried here,
>>>>>>>>>>>>>> since I'm on OS X)
>>>>>>>>>>>>>> I think BOMs were mandatory in text
files on Windows.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2015-06-08 17:53 GMT+02:00 Keegan
Witt <keeganwitt@gmail.com>
>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've always taken a perverse
pleasure in character encoding
>>>>>>>>>>>>>>> problems.  I was intrigued by
this SO question
>>>>>>>>>>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
on
>>>>>>>>>>>>>>> UTF 16 BOMs in Java vs Groovy.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It appears using withPrintWriter(charset)
produces a BOM
>>>>>>>>>>>>>>> whereas new PrintWriter(file,
charset) does not.  As
>>>>>>>>>>>>>>> demonstrated here:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> File file = new File("tmp.txt")try
{
>>>>>>>>>>>>>>>     String text = " "
>>>>>>>>>>>>>>>     String charset = "UTF-16LE"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     file.withPrintWriter(charset)
{ it << text }
>>>>>>>>>>>>>>>     println "withPrintWriter"
>>>>>>>>>>>>>>>     file.getBytes().each { System.out.format("%02x
", it) }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     PrintWriter w = new PrintWriter(file,
charset)
>>>>>>>>>>>>>>>     w.print(text)
>>>>>>>>>>>>>>>     w.close()
>>>>>>>>>>>>>>>     println "\n\nnew PrintWriter"
>>>>>>>>>>>>>>>     file.getBytes().each { System.out.format("%02x
", it) }} finally {
>>>>>>>>>>>>>>>     file.delete()}
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Outputs
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> withPrintWriter
>>>>>>>>>>>>>>> ff fe 20 00
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> new PrintWriter
>>>>>>>>>>>>>>> 20 00
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is this difference in behavior
intentional?  It seems kinda
>>>>>>>>>>>>>>> odd to me.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Keegan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>>>> Product Ninja & Advocate at Restlet
<http://restlet.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Laforge
>>>>>>>>>> Groovy Project Manager
>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>>
>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Laforge
>>>>>>>> Groovy Project Manager
>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>
>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Guillaume Laforge
>>>>>> Groovy Project Manager
>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>
>>>>>> Blog: http://glaforge.appspot.com/
>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Guillaume Laforge
>>>> Groovy Project Manager
>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>
>>>> Blog: http://glaforge.appspot.com/
>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>
>>>
>>>
>>
>


-- 
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet <http://restlet.com>

Blog: http://glaforge.appspot.com/
Social: @glaforge <http://twitter.com/glaforge> / Google+
<https://plus.google.com/u/0/114130972232398734985/posts>

Mime
View raw message