2015-06-09 15:04 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
Created GROOVY-7461 and PR 36.

Cool!
 
How would you feel about a PR to copy the Javadoc comment mentioning the UTF-16 BOM on File.newWriter to all the other methods that use writeUTF16BomIfRequired (at least until we decide we're going to change the current behavior)?

Right, worth it!
 

-Keegan

On Tue, Jun 9, 2015 at 8:17 AM, Guillaume Laforge <glaforge@gmail.com> wrote:
Good point!

2015-06-09 14:11 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:

That's only available in Java 7.  Isn't Groovy still targeting 1.6 for the non-indy version?

-Keegan

On Jun 9, 2015 7:56 AM, "Guillaume Laforge" <glaforge@gmail.com> wrote:
Well spotted!

You could also compare with the StandardCharset, instead of going through the name comparison:

2015-06-09 13:49 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
No, it's a Groovy bug.
private static void writeUTF16BomIfRequired(final String charset, final OutputStream stream) throws IOException {
if ("UTF-16BE".equals(charset)) {
writeUtf16Bom(stream, true);
} else if ("UTF-16LE".equals(charset)) {
writeUtf16Bom(stream, false);
}
}
should be
private static void writeUTF16BomIfRequired(final String charset, final OutputStream stream) throws IOException {
if ("UTF-16BE".equals(Charset.forName(charset).name())) {
writeUtf16Bom(stream, true);
} else if ("UTF-16LE".equals(Charset.forName(charset).name())) {
writeUtf16Bom(stream, false);
}
}
in org.codehaus.groovy.runtime.ResourceGroovyMethods.  We'll probably want to fix that regardless of what we decide on the withPrintWriter question.  I'll open a Jira and a PR.

-Keegan



On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge <glaforge@gmail.com> wrote:
From Groovy's point of view (ie. when you're coding in Groovy), the BOM is automatically discarded when you use one of our reader methods (withReader, etc), so it's transparent whether the BOM is here or not.

I tend to think that having the BOM always is a good thing (I even thought that was mandatory), but Groovy should guess the endianness regardless anyway.

Happy to hear what others think too about all this though.

Guillaume


2015-06-08 23:20 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
The code as-is today writes the BOM regardless of platform.  I just tested in Linux with the same results.  I think there are 2 parts to the question of "what's the correct behavior?"

1.  Should the BOM be written at all, particularly when the platform is Windows?
2.  Should the behavior of withPrintWriter differ (even if the difference is to be smarter) from the behavior of new PrintWriter?

Discussion
1.  Strictly speaking, yes.  Because RFC 2781 states in section 4.3 to assume big endian if there is no BOM.  However, in practice, many applications disregard the RFC and assume little-endian because that's what Windows does.  Because of this, the behavior could be changed so that when writing UTF-16LE on Windows, it doesn't write the BOM.  But in my opinion, it's best practice to always write a BOM when working with UTF-16, and Java should have done this in their implementation of their PrintWriter.

2.  This is a tough one.  Arguably, withPrintWriter is doing the smarter, more correct behavior, but the typical user would assume this is just a shorthand convenience for newing up a PrintWriter (I certainly did).  So the question is, is it better to just document this difference in the GroovyDoc?  Or to change the behavior to be closer to Java?  And if the latter, what breakages would that cause within Groovy itself?  Making that change could break folks in production, because they could rely on that BOM being there, in cases for example where the file is created on Windows, but then processed on Linux or when working with a third party library that is more picky about the presence of a BOM.

-Keegan

On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge <glaforge@gmail.com> wrote:
Now... is it what should be done or not is the good question to ask :-)
Does Windows manages to open UTF-16 files without BOMs?

2015-06-08 22:17 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
I forgot to mention that.  Yes, I ran the test mentioned in Windows.

On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge <glaforge@gmail.com> wrote:
That's a good question.
I guess this is happening on Windows? (I haven't tried here, since I'm on OS X)
I think BOMs were mandatory in text files on Windows.

2015-06-08 17:53 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
I've always taken a perverse pleasure in character encoding problems.  I was intrigued by this SO question on UTF 16 BOMs in Java vs Groovy.

It appears using withPrintWriter(charset) produces a BOM whereas new PrintWriter(file, charset) does not.  As demonstrated here:

File file = new File("tmp.txt")
try {
    String text = " "
    String charset = "UTF-16LE"

    file.withPrintWriter(charset) { it << text }
    println "withPrintWriter"
    file.getBytes().each { System.out.format("%02x ", it) }

    PrintWriter w = new PrintWriter(file, charset)
    w.print(text)
    w.close()
    println "\n\nnew PrintWriter"
    file.getBytes().each { System.out.format("%02x ", it) }
} finally {
    file.delete()
}

Outputs

withPrintWriter
ff fe 20 00 

new PrintWriter
20 00

Is this difference in behavior intentional?  It seems kinda odd to me.

-Keegan



--
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet





--
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet





--
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet





--
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet




--
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet





--
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet