100字范文 > python绘制词云全流程解析 jieba库 NLP分词搭配词频统计输出。

python绘制词云全流程解析 jieba库 NLP分词搭配词频统计输出。

时间：2022-11-04 18:04:10

关键库说明：

re 文本数据处理替换字符串内字符数据jieba 文本分词库，语句拆分WordCloud 词云图片生产库PIL 图像处理相，强大且历史悠久的库matplotlib 绘图库，无需多言

代码逻辑

1 读入NLP 的停用词文件，为了提高效率，将该数据组织为字典结构2 读入待分析统计的文本数据3 使用jieba进行分词操作，包括去除空格、去掉标点符号、增加专业名词等4 对分词结果进行统计汇总(数据结构set去重)5 对统计结果按照频率进行排序（数据结构dic检索统计）6 输出文本统计结果7 读入词云形状图片，（虽然是PNG但实际要的是非透明的图片，标准的PNG数据无法使用）8 按照统计结果绘制词云图像

* 其他注意事项：

文本编码格式为了便于处理，全部选择utf—8统一处理停用词词库可以通过github去找，也可以自己制作，推荐使用：/woshishui68892/article/details/108203121所提供的停用词表，stopword合并后版本。

上代码：

import reimport jiebafrom wordcloud import WordCloudfrom PIL import Imageimport matplotlib.pyplot as pltimport numpy as np# 把停用词做成字典stopwords = {}fstop = open(r"\\vmware-host\Shared Folders\mod\ok\stopword.txt", 'r',encoding='utf-8')for eachWord in fstop:stopwords[eachWord.strip()] = eachWord.strip()fstop.close()fin = open(r'\\vmware-host\Shared Folders\mod\ok\目标文本.txt', 'r',encoding='utf-8') # 以读的方式打开文件# jieba.enable_parallel(4)# 并行分词适用于Linux， win 不适于outStr = ''jieba.add_word('区块链')for eachLine in fin:line = eachLine.strip() # 去除每行首尾可能出现的空格line1 = re.sub("[0-9\s+\.\!\/_,$%^*()?;；:-【】+\"\']+|[+——！，;:。？、~@#￥%……&*（）]+", "",line)wordList = list(jieba.cut(line1)) # 用结巴分词，对每行内容进行分词for word in wordList:if word not in stopwords:outStr += wordoutStr += ' 'fin.close()words = outStr.split(' ') # 将字符串汇总结果打断成单词words_index = set(words) # 去重复利用数据格式set的特性dic = {index: words.count(index) for index in words_index} # 统计词频sorted_x = sorted(dic.items(), key= lambda d:d[1], reverse=True) # 词频排序，由大到小返回结果为listprint(sorted_x) #在控制台中输出结果供查看f_out = open(r"C:\Users\Administrator\Desktop\NLP\out.txt", "w",encoding='utf-8') # 打开文件以便写入f_out.write('词频统计如下： \n')for singe in sorted_x:f_out.write(str(singe[0])+":"+str(singe[1])+' \n')f_out.close() # 关闭文件graph = np.array(Image.open(r'\\vmware-host\Shared Folders\mod\ok\点赞.png')) # 轮廓图片读成像素矩阵wc = WordCloud(background_color='White', mask=graph, font_path=r'C:/Windows/Fonts/simkai.ttf', max_words=80,max_font_size=150) # 设置词云背景颜色及形状 max_words=100出现频率前100的词语wc.generate_from_frequencies(dic) # 读进词频数据wc.to_file(r"C:\Users\Administrator\Desktop\NLP\zhongwenciyun.jpg") # 保存图片# 展示图片plt.imshow(wc)plt.axis("off") # 去除坐标轴plt.show()

Enjoy it！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。