100字范文,内容丰富有趣,生活中的好帮手!
100字范文 > python实现获取文件列表中每个文件关键字

python实现获取文件列表中每个文件关键字

时间:2023-02-12 16:33:35

相关推荐

python实现获取文件列表中每个文件关键字

功能描述:

获取某个路径下的所有文件,提取出每个文件中出现频率最高的前300个字。保存在数据库当中。

前提,你需要配置好nltk

#!/usr/bin/python#coding=utf-8'''function : This script will create a database named mydb thenabstract keywords of files of privacy police.author : Chichodate: /7/28running : python key_extract.py -d path_of_file'''import sys,getoptimport nltkimport MySQLdbfrom nltk.corpus import PlaintextCorpusReadercorpus_root = ""if __name__ == '__main__':opts,args = getopt.getopt(sys.argv[1:], "d:h","directory=help")#get the directoryfor op,value in opts:if op in ("-d", "--directory"):corpus_root = value#actually, the above method to get a directory is a little complicated,you can#do like this'''the input include you path and use sys.argv to get the path ''''''running : python key_extract.py you path_of_filecorpus_root = sys.argv[1]'''# corpus_root is the directory of files of privacy policy, all of the are html filesfilelists = PlaintextCorpusReader(corpus_root, '.*')#get the files' listfiles = filelists.fileids()#connect the databaseconn = MySQLdb.connect(host = 'your_personal_host_ip_address', user = 'rusername', port =your_port, passwd = 'U_password')#get the cursorcurs = conn.cursor()conn.set_character_set('utf8')curs.execute('set names utf8')curs.execute('SET CHARACTER SET utf8;')curs.execute('SET character_set_connection=utf8;')'''conn.text_factory=lambda x: unicode(x, 'utf8', "ignore")#conn.text_factory=str''' # create a database named mydb'''try:curs.execute("create database mydb")except Exception,e:print e'''conn.select_db('mydb')try:for i in range(300):sql = "alter table filekeywords add " + "key" + str(i) + " varchar(45)"curs.execute(sql)except Exception,e:print ei = 0for privacyfile in files:#f = open(privacyfile,'r', encoding= 'utf-8')sql = "insert into filekeywords set id =" + str(i)curs.execute(sql)sql = "update filekeywords set name =" + "'" + privacyfile + "' where id= " + str(i)curs.execute(sql)# get the words in privacy policywordlist = [w for w in filelists.words(privacyfile) if w.isalpha() and len(w)>2]# get the keywordsfdist = nltk.FreqDist(wordlist)vol = fdist.keys()key_num = len(vol)if key_num > 300:key_num = 300for j in range(key_num):sql = "update filekeywords set " + "key" + str(j) + "=" + "'" + vol[j] + "' where id=" + str(i)curs.execute(sql)i = i + mit()curs.close()conn.close()

转载注明出处:/chichoxian/article/details/4603

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。