groovy-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keegan Witt <keeganw...@gmail.com>
Subject Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()
Date Tue, 09 Jun 2015 14:02:10 GMT
Cool.  I'll wait for PR 36 to be merged first, because I also was thinking
the Javadoc would be changed from
    is "UTF-16BE" or "UTF-16LE"
to
    is "UTF-16BE" or "UTF-16LE" (or an equivalent alias)

-Keegan


On Tue, Jun 9, 2015 at 9:08 AM, Guillaume Laforge <glaforge@gmail.com>
wrote:

>
> 2015-06-09 15:04 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>
>> Created GROOVY-7461 <https://issues.apache.org/jira/browse/GROOVY-7461>
>> and PR 36 <https://github.com/apache/incubator-groovy/pull/36>.
>>
>
> Cool!
>
>
>> How would you feel about a PR to copy the Javadoc comment mentioning the
>> UTF-16 BOM on File.newWriter to all the other methods that use
>> writeUTF16BomIfRequired (at least until we decide we're going to change
>> the current behavior)?
>>
>
> Right, worth it!
>
>
>>
>> -Keegan
>>
>> On Tue, Jun 9, 2015 at 8:17 AM, Guillaume Laforge <glaforge@gmail.com>
>> wrote:
>>
>>> Good point!
>>>
>>> 2015-06-09 14:11 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>
>>>> That's only available in Java 7.  Isn't Groovy still targeting 1.6 for
>>>> the non-indy version?
>>>>
>>>> -Keegan
>>>> On Jun 9, 2015 7:56 AM, "Guillaume Laforge" <glaforge@gmail.com> wrote:
>>>>
>>>>> Well spotted!
>>>>>
>>>>> You could also compare with the StandardCharset, instead of going
>>>>> through the name comparison:
>>>>>
>>>>> http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
>>>>>
>>>>> 2015-06-09 13:49 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>
>>>>>> No, it's a Groovy bug.
>>>>>>
>>>>>> private static void writeUTF16BomIfRequired(final String charset,
final OutputStream stream) throws IOException {
>>>>>>     if ("UTF-16BE".equals(charset)) {
>>>>>>         writeUtf16Bom(stream, true);
>>>>>>     } else if ("UTF-16LE".equals(charset)) {
>>>>>>         writeUtf16Bom(stream, false);
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> should be
>>>>>>
>>>>>> private static void writeUTF16BomIfRequired(final String charset,
final OutputStream stream) throws IOException {
>>>>>>     if ("UTF-16BE".equals(Charset.forName(charset).name())) {
>>>>>>         writeUtf16Bom(stream, true);
>>>>>>     } else if ("UTF-16LE".equals(Charset.forName(charset).name()))
{
>>>>>>         writeUtf16Bom(stream, false);
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> in org.codehaus.groovy.runtime.ResourceGroovyMethods.  We'll probably
>>>>>> want to fix that regardless of what we decide on the
>>>>>> *withPrintWriter* question.  I'll open a Jira and a PR.
>>>>>>
>>>>>> -Keegan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge <glaforge@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> From Groovy's point of view (ie. when you're coding in Groovy),
the
>>>>>>> BOM is automatically discarded when you use one of our reader
methods
>>>>>>> (withReader, etc), so it's transparent whether the BOM is here
or not.
>>>>>>>
>>>>>>> I tend to think that having the BOM always is a good thing (I
even
>>>>>>> thought that was mandatory), but Groovy should guess the endianness
>>>>>>> regardless anyway.
>>>>>>>
>>>>>>> Happy to hear what others think too about all this though.
>>>>>>>
>>>>>>> Guillaume
>>>>>>>
>>>>>>>
>>>>>>> 2015-06-08 23:20 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>
>>>>>>>> The code as-is today writes the BOM regardless of platform.
 I just
>>>>>>>> tested in Linux with the same results.  I think there are
2 parts to the
>>>>>>>> question of "what's the correct behavior?"
>>>>>>>>
>>>>>>>> 1.  Should the BOM be written at all, particularly when the
>>>>>>>> platform is Windows?
>>>>>>>> 2.  Should the behavior of *withPrintWriter* differ (even
if the
>>>>>>>> difference is to be smarter) from the behavior of *new PrintWriter*
>>>>>>>> ?
>>>>>>>>
>>>>>>>> *Discussion*
>>>>>>>> 1.  Strictly speaking, yes.  Because RFC 2781
>>>>>>>> <http://tools.ietf.org/html/rfc2781> states in section
4.3 to
>>>>>>>> assume big endian if there is no BOM.  However, in practice,
many
>>>>>>>> applications disregard the RFC and assume little-endian because
that's what Windows
>>>>>>>> does
>>>>>>>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>>>>>>>> Because of this, the behavior could be changed so that when
writing
>>>>>>>> UTF-16LE on Windows, it doesn't write the BOM.  But in my
opinion, it's
>>>>>>>> best practice to always write a BOM when working with UTF-16,
and Java
>>>>>>>> should have done this in their implementation of their PrintWriter.
>>>>>>>>
>>>>>>>> 2.  This is a tough one.  Arguably, *withPrintWriter* is
doing the
>>>>>>>> smarter, more correct behavior, but the typical user would
assume this is
>>>>>>>> just a shorthand convenience for newing up a PrintWriter
(I certainly
>>>>>>>> did).  So the question is, is it better to just document
this difference in
>>>>>>>> the GroovyDoc?  Or to change the behavior to be closer to
Java?  And if the
>>>>>>>> latter, what breakages would that cause within Groovy itself?
 Making that
>>>>>>>> change could break folks in production, because they could
rely on that BOM
>>>>>>>> being there, in cases for example where the file is created
on Windows, but
>>>>>>>> then processed on Linux or when working with a third party
library that is
>>>>>>>> more picky about the presence of a BOM.
>>>>>>>>
>>>>>>>> -Keegan
>>>>>>>>
>>>>>>>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge <
>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Now... is it what should be done or not is the good question
to
>>>>>>>>> ask :-)
>>>>>>>>> Does Windows manages to open UTF-16 files without BOMs?
>>>>>>>>>
>>>>>>>>> 2015-06-08 22:17 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>
>>>>>>>>>> I forgot to mention that.  Yes, I ran the test mentioned
in
>>>>>>>>>> Windows.
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge
<
>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> That's a good question.
>>>>>>>>>>> I guess this is happening on Windows? (I haven't
tried here,
>>>>>>>>>>> since I'm on OS X)
>>>>>>>>>>> I think BOMs were mandatory in text files on
Windows.
>>>>>>>>>>>
>>>>>>>>>>> 2015-06-08 17:53 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> I've always taken a perverse pleasure in
character encoding
>>>>>>>>>>>> problems.  I was intrigued by this SO question
>>>>>>>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
on
>>>>>>>>>>>> UTF 16 BOMs in Java vs Groovy.
>>>>>>>>>>>>
>>>>>>>>>>>> It appears using withPrintWriter(charset)
produces a BOM
>>>>>>>>>>>> whereas new PrintWriter(file, charset) does
not.  As
>>>>>>>>>>>> demonstrated here:
>>>>>>>>>>>>
>>>>>>>>>>>> File file = new File("tmp.txt")try {
>>>>>>>>>>>>     String text = " "
>>>>>>>>>>>>     String charset = "UTF-16LE"
>>>>>>>>>>>>
>>>>>>>>>>>>     file.withPrintWriter(charset) { it <<
text }
>>>>>>>>>>>>     println "withPrintWriter"
>>>>>>>>>>>>     file.getBytes().each { System.out.format("%02x
", it) }
>>>>>>>>>>>>
>>>>>>>>>>>>     PrintWriter w = new PrintWriter(file,
charset)
>>>>>>>>>>>>     w.print(text)
>>>>>>>>>>>>     w.close()
>>>>>>>>>>>>     println "\n\nnew PrintWriter"
>>>>>>>>>>>>     file.getBytes().each { System.out.format("%02x
", it) }} finally {
>>>>>>>>>>>>     file.delete()}
>>>>>>>>>>>>
>>>>>>>>>>>> Outputs
>>>>>>>>>>>>
>>>>>>>>>>>> withPrintWriter
>>>>>>>>>>>> ff fe 20 00
>>>>>>>>>>>>
>>>>>>>>>>>> new PrintWriter
>>>>>>>>>>>> 20 00
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Is this difference in behavior intentional?
 It seems kinda odd
>>>>>>>>>>>> to me.
>>>>>>>>>>>>
>>>>>>>>>>>> -Keegan
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>>>
>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Laforge
>>>>>>>>> Groovy Project Manager
>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>
>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Guillaume Laforge
>>>>>>> Groovy Project Manager
>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>
>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Guillaume Laforge
>>>>> Groovy Project Manager
>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>
>>>>> Blog: http://glaforge.appspot.com/
>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Guillaume Laforge
>>> Groovy Project Manager
>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>
>>> Blog: http://glaforge.appspot.com/
>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>
>>
>>
>
>
> --
> Guillaume Laforge
> Groovy Project Manager
> Product Ninja & Advocate at Restlet <http://restlet.com>
>
> Blog: http://glaforge.appspot.com/
> Social: @glaforge <http://twitter.com/glaforge> / Google+
> <https://plus.google.com/u/0/114130972232398734985/posts>
>

Mime
View raw message