Files and Resources do not handle UTF-8 files with BOM #345

gissuebot · 2014-10-31T17:08:33Z

Original issue created by kai@google.com on 2010-04-08 at 07:59 PM

By the UTF-8 definition, UTF-8 files are allowed to have an optional leading
BOM. This BOM is stupid and pointless, but many Windows apps seem to
generate UTF-8 files with the BOM. Guava's classes Files and Resources do
not handle UTF-8 files with a BOM. I'm not sure where this fix belongs, or
whether it should even be fixed at all (since Windows is being stupid, and
people are rightly sick and tired of working around Windows issues). BTW, I
don't personally use Windows. I'm reporting this issue only because I
maintain a library that uses Guava, and there are some Windows users of my
library that are running into this issue.

gissuebot · 2014-10-31T18:22:59Z

Original comment posted by kevinb@google.com on 2010-04-09 at 07:05 PM

Good to know. Based on this, it will probably make sense for us to check for a byte-
order mark and just advance past it. Do JDK classes like Reader do this, I assume?

gissuebot · 2014-10-31T18:56:47Z

Original comment posted by fry@google.com on 2011-01-28 at 04:03 PM

(No comment entered for this change.)

Status: Accepted
Labels: Type-Defect

gissuebot · 2014-10-31T19:00:12Z

Original comment posted by mail4danny on 2011-02-04 at 04:18 PM

No, the JDK just quietly ignores this ;-)

gissuebot · 2014-10-31T19:00:15Z

Original comment posted by finnw1 on 2011-02-04 at 09:01 PM

The JDK does detect (and strip) the BOM for some encodings, e.g.
Standard encodings:
UTF-16
UTF-32
Non-standard encodings (that are reported by Charset.availableCharsets()) on my system:
x-UTF-16LE-BOM
X-UTF-32BE-BOM
X-UTF-32LE-BOM
It's for those that are not expected to contain a BOM that the BOM is returned to the application.
It sounds as though what you want is a standard encoding based on UTF-8 that accepts a BOM, e.g.
UTF-8-BOM
But (1) this is a feature request not a defect and
(2) it belongs in the JDK not Guava

gissuebot · 2014-10-31T19:28:27Z

Original comment posted by kevinb@google.com on 2011-07-13 at 06:18 PM

(No comment entered for this change.)

Status: Triaged

gissuebot · 2014-10-31T20:15:18Z

Original comment posted by fry@google.com on 2011-12-10 at 03:45 PM

(No comment entered for this change.)

Labels: Package-IO

gissuebot · 2014-10-31T20:49:49Z

Original comment posted by fry@google.com on 2012-02-16 at 07:17 PM

(No comment entered for this change.)

Status: Acknowledged

gissuebot · 2014-11-01T00:03:23Z

Original comment posted by kevinb@google.com on 2012-06-22 at 06:16 PM

(No comment entered for this change.)

Status: Research

gissuebot · 2014-11-01T00:54:33Z

Original comment posted by j...@durchholz.org on 2012-12-27 at 10:01 AM

The BOM is useful if the program needs a way to autodetect a text file's encoding. This is so even in Unixoid systems, the output from the file command says "UTF-8 Unicode (with BOM) text", so if it's a misfeature, it's not just one of Windows. Of course it's just heuristics, but heuristics does have its place.
Arguing that something belongs into the JDK instead of into Guava ignores the very mission statement of Guava, which is essentially "let's do things right where the JDK dropped the ball". So in fact if the JDK does this wrong, Guava should do something about it.
Some programs want to see the BOM, others want to have the BOM skipped for them if it's present. Programs need a way to express that. Using different character sets would cover that.
I'm not sure what the semantics of an x-UTF-8-BOM charset would be when writing: Write a BOM or not? The path of minimal resistance would be to write the BOM with x-UTF-8-BOM and leave it unwritten with UTF-8, but that would punish the best approach (ignore BOM on input, don't write it on output) with the most complicated handling (different character sets for reading and writing).

gissuebot · 2014-11-01T01:55:48Z

Original comment posted by NikolayMetchev on 2014-01-07 at 11:57 AM

This was filed as a bug in the JDK. The decided not to fix it there for backward compatibility reasons:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
Google Data API has a solution which could be moved to Guava:
https://developers.google.com/gdata/javadoc/com/google/gdata/util/io/base/UnicodeReader?csw=1

garretwilson · 2015-11-10T02:44:20Z

Any progress on this? Won't Guava help us read a BOM?

jredfox · 2018-04-13T20:10:01Z

have their input stream remove any char with the value of 65279 at index 0. It's not pointless notepad uses it to easily determine what utf type the file is in. To be honest I think this is what file headers are made for why not just have a file header with the string utf-x in front of it only takes a couple bytes but, I didn't make utf protocal

gissuebot added type=defect Bug, not working as expected migrated status=research package=io labels Nov 1, 2014

cgdecker removed the migrated label Nov 1, 2014

pathikrit mentioned this issue Feb 10, 2017

Unicode BOM support ? pathikrit/better-files#107

Closed

netdpb added the P3 label Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files and Resources do not handle UTF-8 files with BOM #345

Files and Resources do not handle UTF-8 files with BOM #345

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014 •

edited by cgdecker

gissuebot commented Nov 1, 2014

garretwilson commented Nov 10, 2015

jredfox commented Apr 13, 2018 •

edited

Files and Resources do not handle UTF-8 files with BOM #345

Files and Resources do not handle UTF-8 files with BOM #345

Comments

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Oct 31, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014 • edited by cgdecker

gissuebot commented Nov 1, 2014

garretwilson commented Nov 10, 2015

jredfox commented Apr 13, 2018 • edited

gissuebot commented Nov 1, 2014 •

edited by cgdecker

jredfox commented Apr 13, 2018 •

edited