100字范文 > java txt转pdf乱码 openoffice将txt文本转pdf中文乱码

java txt转pdf乱码 openoffice将txt文本转pdf中文乱码

时间：2019-11-23 12:12:27

问题描述：

使用openoffice将txt文本转pdf的过程中发现中文乱码。

解决思路及过程：

1、查看出现乱码的原因

经查询jodconverter源码发现，只有utf-8编码的文本才不会中文乱码。

2、怎么样将非utf-8编码文件转换成utf-8文件。

要转之前首先要判断txt文本本身的编码。经查发现txt文本有一个头。

判断方法如下

/**

* 根据文件路径返回文件编码

* @param filePath

* @return

* @throws IOException

public static String getCharset(String filePath) throws IOException{

BufferedInputStream bin = new BufferedInputStream(new FileInputStream(

filePath));

int p = (bin.read() << 8) + bin.read();

String code = null;

switch (p) {

case 0xefbb:

code = "UTF-8";

break;

case 0xfffe:

code = "Unicode";

break;

case 0xfeff:

code = "UTF-16";

break;

default:

code = "GB2312";

}

System.out.println(code);

return code;

}

转换代码如下

/**

* 以指定编码方式写文本文件，存在会覆盖

* @param file

* 要写入的文件

* @param toCharsetName

* 要转换的编码

* @param content

* 文件内容

* @throws Exception

public static void saveFile2Charset(File file, String toCharsetName,

String content) throws Exception {

if (!Charset.isSupported(toCharsetName)) {

throw new UnsupportedCharsetException(toCharsetName);

}

OutputStream outputStream = new FileOutputStream(file);

OutputStreamWriter outWrite = new OutputStreamWriter(outputStream,

toCharsetName);

outWrite.write(content);

outWrite.close();

}

经测试发现，转换后的文本，获取的头还是gbk的，只有手机将头文件中blob生成

代码如下：

/**

* 以指定编码方式写文本文件，存在会覆盖

* @param file

* 要写入的文件

* @param toCharsetName

* 要转换的编码

* @param content

* 文件内容

* @throws Exception

public static void saveFile2Charset(File file, String toCharsetName,

String content) throws Exception {

if (!Charset.isSupported(toCharsetName)) {

throw new UnsupportedCharsetException(toCharsetName);

}

OutputStream outputStream = new FileOutputStream(file);

//增加头文件标识

outputStream.write(new byte[]{(byte)0xEF, (byte)0xBB, (byte)0xBF});

OutputStreamWriter outWrite = new OutputStreamWriter(outputStream,

toCharsetName);

outWrite.write(content);

outWrite.close();

}

经测试

GB2312

Unicode

UTF-16

UTF-8

都成功。

txt编码和头文件说明

java编码与txt编码对应

java

txt

unicode

unicode big endian

utf-8

utf-16

unicode

gb2312

ANSI

什么是BOM

BOM(byte-order mark)，即字节顺序标记，它是插入到以UTF-8、UTF16或UTF-32编码Unicode文件开头的特殊标记，用来识别Unicode文件的编码类型。对于UTF-8来说，BOM并不是必须的，因为BOM用来标记多字节编码文件的编码类型和字节顺序(big-endian或little- endian)。

BOMs 文件头:

00 00 FE FF = UTF-32, big-endian

FF FE 00 00 = UTF-32, little-endian

EF BB BF = UTF-8,

FE FF= UTF-16, big-endian

FF FE= UTF-16, little-endian

注：jodconverter 2.2.1不支持docx 、xlsx、ppt、文件转pdf

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。