java - Converting an ANSI file with German characters to UTF8 -
i have downloaded plain text files german website not sure encoding is. there no byte markers in file. using parser assumes files encoded in utf8, not handling accented characters (those fall in byte range > 127)
i convert utf8 not sure if need know encoding this.
the way others have been handling these files manually open in windows notepad , re-saving in utf8. process preserves accented characters, automate conversion if possible without resorting windows notepad.
how windows notepad know how convert utf8 properly?
how should convert file utf8 (in java 6)?
in java 7 text "windows-1252" windows latin-1.
path oldpath = paths.get("c:/temp/old.txt"); path newpath = paths.get("c:/temp/new.txt"); byte[] bytes = files.readallbytes(oldpath); string content = "\ufeff" + new string(bytes, "windows-1252"); bytes = content.getbytes("utf-8"); files.write(newpath, bytes, standardoption.write);
this takes bytes, interpretes them windows latin-1. , notepad trick: notepad recognizes encoding preceding bom marker character. zero-width space, not used in utf-8.
then takes string utf-8 encoding.
windows-1252 iso-8859-1 (pure latin-1) has special characters, comma quotes, of range 0x80 - 0xbf.
in java 6:
file oldpath = new file("c:/temp/old.txt"); file newpath = new file("c:/temp/new.txt"); long longlength = oldpath.length(); if (longlength > integer.max_value) { throw new illegalargumentexception("file large: " + oldpath.getpath()); } int filesize = (int)longlength; byte[] bytes = new byte[filesize]; inputstream in = new fileinputstream(oldpath); int nread = in.read(bytes); in.close(); assert nread == filesize; string content = "\ufeff" + new string(bytes, "windows-1252"); bytes = content.getbytes("utf-8"); outputstream out = new fileoutputstream(newpath); out.write(bytes); out.close();
Comments
Post a Comment