Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files and Resources do not handle UTF-8 files with BOM #345

Open
gissuebot opened this issue Oct 31, 2014 · 12 comments
Open

Files and Resources do not handle UTF-8 files with BOM #345

gissuebot opened this issue Oct 31, 2014 · 12 comments
Labels

Comments

@gissuebot
Copy link

Original issue created by kai@google.com on 2010-04-08 at 07:59 PM


By the UTF-8 definition, UTF-8 files are allowed to have an optional leading
BOM. This BOM is stupid and pointless, but many Windows apps seem to
generate UTF-8 files with the BOM. Guava's classes Files and Resources do
not handle UTF-8 files with a BOM. I'm not sure where this fix belongs, or
whether it should even be fixed at all (since Windows is being stupid, and
people are rightly sick and tired of working around Windows issues). BTW, I
don't personally use Windows. I'm reporting this issue only because I
maintain a library that uses Guava, and there are some Windows users of my
library that are running into this issue.

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2010-04-09 at 07:05 PM


Good to know. Based on this, it will probably make sense for us to check for a byte-
order mark and just advance past it. Do JDK classes like Reader do this, I assume?

@gissuebot
Copy link
Author

Original comment posted by fry@google.com on 2011-01-28 at 04:03 PM


(No comment entered for this change.)


Status: Accepted
Labels: Type-Defect

@gissuebot
Copy link
Author

Original comment posted by mail4danny on 2011-02-04 at 04:18 PM


No, the JDK just quietly ignores this ;-)

@gissuebot
Copy link
Author

Original comment posted by finnw1 on 2011-02-04 at 09:01 PM


The JDK does detect (and strip) the BOM for some encodings, e.g.
Standard encodings:
UTF-16
UTF-32
Non-standard encodings (that are reported by Charset.availableCharsets()) on my system:
x-UTF-16LE-BOM
X-UTF-32BE-BOM
X-UTF-32LE-BOM
It's for those that are not expected to contain a BOM that the BOM is returned to the application.
It sounds as though what you want is a standard encoding based on UTF-8 that accepts a BOM, e.g.
UTF-8-BOM
But (1) this is a feature request not a defect and
    (2) it belongs in the JDK not Guava

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2011-07-13 at 06:18 PM


(No comment entered for this change.)


Status: Triaged

@gissuebot
Copy link
Author

Original comment posted by fry@google.com on 2011-12-10 at 03:45 PM


(No comment entered for this change.)


Labels: Package-IO

@gissuebot
Copy link
Author

Original comment posted by fry@google.com on 2012-02-16 at 07:17 PM


(No comment entered for this change.)


Status: Acknowledged

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2012-06-22 at 06:16 PM


(No comment entered for this change.)


Status: Research

@gissuebot
Copy link
Author

gissuebot commented Nov 1, 2014

Original comment posted by j...@durchholz.org on 2012-12-27 at 10:01 AM


  1. The BOM is useful if the program needs a way to autodetect a text file's encoding. This is so even in Unixoid systems, the output from the file command says "UTF-8 Unicode (with BOM) text", so if it's a misfeature, it's not just one of Windows. Of course it's just heuristics, but heuristics does have its place.
  2. Arguing that something belongs into the JDK instead of into Guava ignores the very mission statement of Guava, which is essentially "let's do things right where the JDK dropped the ball". So in fact if the JDK does this wrong, Guava should do something about it.
  3. Some programs want to see the BOM, others want to have the BOM skipped for them if it's present. Programs need a way to express that. Using different character sets would cover that.
  4. I'm not sure what the semantics of an x-UTF-8-BOM charset would be when writing: Write a BOM or not? The path of minimal resistance would be to write the BOM with x-UTF-8-BOM and leave it unwritten with UTF-8, but that would punish the best approach (ignore BOM on input, don't write it on output) with the most complicated handling (different character sets for reading and writing).

@gissuebot
Copy link
Author

Original comment posted by NikolayMetchev on 2014-01-07 at 11:57 AM


This was filed as a bug in the JDK. The decided not to fix it there for backward compatibility reasons:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
Google Data API has a solution which could be moved to Guava:
https://developers.google.com/gdata/javadoc/com/google/gdata/util/io/base/UnicodeReader?csw=1

@garretwilson
Copy link

Any progress on this? Won't Guava help us read a BOM?

@jredfox
Copy link

jredfox commented Apr 13, 2018

have their input stream remove any char with the value of 65279 at index 0. It's not pointless notepad uses it to easily determine what utf type the file is in. To be honest I think this is what file headers are made for why not just have a file header with the string utf-x in front of it only takes a couple bytes but, I didn't make utf protocal

@netdpb netdpb added the P3 label Aug 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants