groovy-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillaume Laforge <glafo...@gmail.com>
Subject Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()
Date Tue, 09 Jun 2015 13:08:05 GMT
2015-06-09 15:04 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:

> Created GROOVY-7461 <https://issues.apache.org/jira/browse/GROOVY-7461>
> and PR 36 <https://github.com/apache/incubator-groovy/pull/36>.
>

Cool!


> How would you feel about a PR to copy the Javadoc comment mentioning the
> UTF-16 BOM on File.newWriter to all the other methods that use
> writeUTF16BomIfRequired (at least until we decide we're going to change
> the current behavior)?
>

Right, worth it!


>
> -Keegan
>
> On Tue, Jun 9, 2015 at 8:17 AM, Guillaume Laforge <glaforge@gmail.com>
> wrote:
>
>> Good point!
>>
>> 2015-06-09 14:11 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>
>>> That's only available in Java 7.  Isn't Groovy still targeting 1.6 for
>>> the non-indy version?
>>>
>>> -Keegan
>>> On Jun 9, 2015 7:56 AM, "Guillaume Laforge" <glaforge@gmail.com> wrote:
>>>
>>>> Well spotted!
>>>>
>>>> You could also compare with the StandardCharset, instead of going
>>>> through the name comparison:
>>>>
>>>> http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
>>>>
>>>> 2015-06-09 13:49 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>
>>>>> No, it's a Groovy bug.
>>>>>
>>>>> private static void writeUTF16BomIfRequired(final String charset, final
OutputStream stream) throws IOException {
>>>>>     if ("UTF-16BE".equals(charset)) {
>>>>>         writeUtf16Bom(stream, true);
>>>>>     } else if ("UTF-16LE".equals(charset)) {
>>>>>         writeUtf16Bom(stream, false);
>>>>>     }
>>>>> }
>>>>>
>>>>> should be
>>>>>
>>>>> private static void writeUTF16BomIfRequired(final String charset, final
OutputStream stream) throws IOException {
>>>>>     if ("UTF-16BE".equals(Charset.forName(charset).name())) {
>>>>>         writeUtf16Bom(stream, true);
>>>>>     } else if ("UTF-16LE".equals(Charset.forName(charset).name())) {
>>>>>         writeUtf16Bom(stream, false);
>>>>>     }
>>>>> }
>>>>>
>>>>> in org.codehaus.groovy.runtime.ResourceGroovyMethods.  We'll probably
>>>>> want to fix that regardless of what we decide on the *withPrintWriter*
>>>>> question.  I'll open a Jira and a PR.
>>>>>
>>>>> -Keegan
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge <glaforge@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> From Groovy's point of view (ie. when you're coding in Groovy), the
>>>>>> BOM is automatically discarded when you use one of our reader methods
>>>>>> (withReader, etc), so it's transparent whether the BOM is here or
not.
>>>>>>
>>>>>> I tend to think that having the BOM always is a good thing (I even
>>>>>> thought that was mandatory), but Groovy should guess the endianness
>>>>>> regardless anyway.
>>>>>>
>>>>>> Happy to hear what others think too about all this though.
>>>>>>
>>>>>> Guillaume
>>>>>>
>>>>>>
>>>>>> 2015-06-08 23:20 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>
>>>>>>> The code as-is today writes the BOM regardless of platform. 
I just
>>>>>>> tested in Linux with the same results.  I think there are 2 parts
to the
>>>>>>> question of "what's the correct behavior?"
>>>>>>>
>>>>>>> 1.  Should the BOM be written at all, particularly when the platform
>>>>>>> is Windows?
>>>>>>> 2.  Should the behavior of *withPrintWriter* differ (even if
the
>>>>>>> difference is to be smarter) from the behavior of *new PrintWriter*?
>>>>>>>
>>>>>>> *Discussion*
>>>>>>> 1.  Strictly speaking, yes.  Because RFC 2781
>>>>>>> <http://tools.ietf.org/html/rfc2781> states in section
4.3 to
>>>>>>> assume big endian if there is no BOM.  However, in practice,
many
>>>>>>> applications disregard the RFC and assume little-endian because
that's what Windows
>>>>>>> does
>>>>>>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>>>>>>> Because of this, the behavior could be changed so that when writing
>>>>>>> UTF-16LE on Windows, it doesn't write the BOM.  But in my opinion,
it's
>>>>>>> best practice to always write a BOM when working with UTF-16,
and Java
>>>>>>> should have done this in their implementation of their PrintWriter.
>>>>>>>
>>>>>>> 2.  This is a tough one.  Arguably, *withPrintWriter* is doing
the
>>>>>>> smarter, more correct behavior, but the typical user would assume
this is
>>>>>>> just a shorthand convenience for newing up a PrintWriter (I certainly
>>>>>>> did).  So the question is, is it better to just document this
difference in
>>>>>>> the GroovyDoc?  Or to change the behavior to be closer to Java?
 And if the
>>>>>>> latter, what breakages would that cause within Groovy itself?
 Making that
>>>>>>> change could break folks in production, because they could rely
on that BOM
>>>>>>> being there, in cases for example where the file is created on
Windows, but
>>>>>>> then processed on Linux or when working with a third party library
that is
>>>>>>> more picky about the presence of a BOM.
>>>>>>>
>>>>>>> -Keegan
>>>>>>>
>>>>>>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge <
>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>
>>>>>>>> Now... is it what should be done or not is the good question
to ask
>>>>>>>> :-)
>>>>>>>> Does Windows manages to open UTF-16 files without BOMs?
>>>>>>>>
>>>>>>>> 2015-06-08 22:17 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>
>>>>>>>>> I forgot to mention that.  Yes, I ran the test mentioned
in
>>>>>>>>> Windows.
>>>>>>>>>
>>>>>>>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge <
>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> That's a good question.
>>>>>>>>>> I guess this is happening on Windows? (I haven't
tried here,
>>>>>>>>>> since I'm on OS X)
>>>>>>>>>> I think BOMs were mandatory in text files on Windows.
>>>>>>>>>>
>>>>>>>>>> 2015-06-08 17:53 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> I've always taken a perverse pleasure in character
encoding
>>>>>>>>>>> problems.  I was intrigued by this SO question
>>>>>>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
on
>>>>>>>>>>> UTF 16 BOMs in Java vs Groovy.
>>>>>>>>>>>
>>>>>>>>>>> It appears using withPrintWriter(charset) produces
a BOM
>>>>>>>>>>> whereas new PrintWriter(file, charset) does not.
 As
>>>>>>>>>>> demonstrated here:
>>>>>>>>>>>
>>>>>>>>>>> File file = new File("tmp.txt")try {
>>>>>>>>>>>     String text = " "
>>>>>>>>>>>     String charset = "UTF-16LE"
>>>>>>>>>>>
>>>>>>>>>>>     file.withPrintWriter(charset) { it <<
text }
>>>>>>>>>>>     println "withPrintWriter"
>>>>>>>>>>>     file.getBytes().each { System.out.format("%02x
", it) }
>>>>>>>>>>>
>>>>>>>>>>>     PrintWriter w = new PrintWriter(file, charset)
>>>>>>>>>>>     w.print(text)
>>>>>>>>>>>     w.close()
>>>>>>>>>>>     println "\n\nnew PrintWriter"
>>>>>>>>>>>     file.getBytes().each { System.out.format("%02x
", it) }} finally {
>>>>>>>>>>>     file.delete()}
>>>>>>>>>>>
>>>>>>>>>>> Outputs
>>>>>>>>>>>
>>>>>>>>>>> withPrintWriter
>>>>>>>>>>> ff fe 20 00
>>>>>>>>>>>
>>>>>>>>>>> new PrintWriter
>>>>>>>>>>> 20 00
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Is this difference in behavior intentional? 
It seems kinda odd
>>>>>>>>>>> to me.
>>>>>>>>>>>
>>>>>>>>>>> -Keegan
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Laforge
>>>>>>>>>> Groovy Project Manager
>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>>
>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Laforge
>>>>>>>> Groovy Project Manager
>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>
>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Guillaume Laforge
>>>>>> Groovy Project Manager
>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>
>>>>>> Blog: http://glaforge.appspot.com/
>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Guillaume Laforge
>>>> Groovy Project Manager
>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>
>>>> Blog: http://glaforge.appspot.com/
>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>
>>>
>>
>>
>> --
>> Guillaume Laforge
>> Groovy Project Manager
>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>
>> Blog: http://glaforge.appspot.com/
>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>
>
>


-- 
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet <http://restlet.com>

Blog: http://glaforge.appspot.com/
Social: @glaforge <http://twitter.com/glaforge> / Google+
<https://plus.google.com/u/0/114130972232398734985/posts>

Mime
View raw message