100字范文 > 【自然语言处理】词袋模型在文本分类中的用法

【自然语言处理】词袋模型在文本分类中的用法

时间：2019-01-29 00:29:46

词袋模型在文本分类中的用法

1.加载数据

20 Newsgroups：数据被组织成 20 个不同的新闻组，每个新闻组对应一个不同的主题。一些新闻组彼此非常密切相关（例如comp.sys.ibm.pc.hardware/comp.sys.mac.hardware），而其他新闻组则非常不相关（例如misc.forsale/soc.religion.christian）。以下是 20 个新闻组的列表，按主题划分（或多或少）：

数据集：下载地址

# 加载训练集、测试集from sklearn import datasetstwenty_train = datasets.load_files("20news-bydate/20news-bydate-train")twenty_test = datasets.load_files("20news-bydate/20news-bydate-test")

sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error=‘strict’, random_state=0)

container_path：container_folder 的路径；load_content = True：是否把文件中的内容加载到内存；encoding = None：编码方式。当前文本文件的编码方式一般为 utf-8，如果不指明编码方式（encoding=None），那么文件内容将会按照 bytes 处理，而不是 unicode 处理。

返回值：Bunch，Dictionary-like object。主要属性包括：

data：原始数据；filenames：每个文件的名字；target：类别标签（从 0 0 0 开始的整数索引）；target_names：类别标签的具体含义（由子文件夹的名字 category_1_folder 等决定）。

2.文本表示

将文本文件变成数字的特征表示，这里主要是利用词袋模型。

2.1 BOW

使用 CountVectorizer 构建词频向量。CountVectorizer 支持单词或者连续字符的 N-gram 模型的计数，利用 scipy.sparse 矩阵只在内存中保存特征向量中非 0 0 0 元素位置以节省内存。

from sklearn.feature_extraction.text import CountVectorizercount_vect = CountVectorizer(stop_words="english", decode_error='ignore') # 创建词频转换器X_train_counts = count_vect.fit_transform(twenty_train.data) # 转换训练集print(X_train_counts.shape)

(11314, 129782)

2.2 TF-IDF

用 TfidfTransformer 将词频向量转为 TF-IDF 形式。

from sklearn.feature_extraction.text import TfidfTransformertfidf_transformer = TfidfTransformer()X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)print(X_train_tfidf.shape)

(11314, 129782)

3.文本分类

3.1 正常流程

离散型朴素贝叶斯 MultinomialNB，朴素贝叶斯模型在多项分布假设下的实现，适用于多分类场景。

from sklearn.naive_bayes import MultinomialNBclf = MultinomialNB()clf.fit(X_train_tfidf, twenty_train.target)docs_new = ['God is love', 'OpenGL on the GPU is fast']X_new_counts = count_vect.transform(docs_new) # 计算词频X_new_tfidf = tfidf_transformer.transform(X_new_counts) # 计算 TF-IDFy_pred = clf.predict(X_new_tfidf)for doc, category in zip(docs_new, y_pred): # category是数字print(("%r => %s")%(doc, twenty_train.target_names[category]))

3.2 管道流水

使用管道后，测试集不用一步步重复训练集的预处理，直接管道处理了。

from sklearn.pipeline import Pipelinetext_clf = Pipeline([('vect', CountVectorizer(stop_words="english", decode_error='ignore')),('tfidf', TfidfTransformer()),('clf', MultinomialNB())])text_clf = text_clf.fit(twenty_train.data, twenty_train.target) # 训练集score = text_clf.score(twenty_test.data, twenty_test.target) # 测试集print(score)

0.8169144981412639

SVM，对于大型数据集，使用 LinearSVC 或 SGDClassifier。

from sklearn.linear_model import SGDClassifierfrom sklearn import metricstext_clf_2 = Pipeline([('vect', CountVectorizer(stop_words='english', decode_error='ignore')), # 去停用词('tfidf', TfidfTransformer()),('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=100, random_state=42))])text_clf_2.fit(twenty_train.data, twenty_train.target)# text_clf_2.score(twenty_test.data, twenty_test.target)y_pred = text_clf_2.predict(twenty_test.data)acc = metrics.accuracy_score(twenty_test.target, y_pred)print(acc)

0.8227562400424854

4.结果报告

各类别的精确度，召回率，F值等。

print(metrics.classification_report(twenty_test.target, y_pred, target_names=twenty_test.target_names))

5.超参调优

5.1 网格搜索

CountVectorizer()中的n-gramTfidfTransformer()中的use_idfSGClassifier()中的惩罚系数alpha

from sklearn.model_selection import GridSearchCVparameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3)}gs_clf = GridSearchCV(text_clf_2, parameters, n_jobs=-1, cv=5) # text_clf_2: SVC 的 Pipelinegrid_result = gs_clf.fit(twenty_train.data, twenty_train.target)print("Best: %f \nusing %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.906842 using {'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

5.2 随机搜索

随机搜索进行超参数优化。

from sklearn.model_selection import RandomizedSearchCVparameters = {'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-1, 1e-2, 1e-3, 1e-4, 1e-5)}rs_clf = RandomizedSearchCV(text_clf_2, parameters, n_jobs=-1, cv=5)rs_result = rs_clf.fit(twenty_train.data, twenty_train.target)print("Best: %f \nusing %s" % (rs_result.best_score_, rs_result.best_params_))

Best: 0.927436 using {'vect__ngram_range': (1, 2), 'tfidf__use_idf': True, 'clf__alpha': 1e-05}

已确定 TfidfTransformer 里的参数use_idf=True会更好。

5.3 网格搜索-增加超参数

增加参数：

CountVectorizer 里的max_df：按比例或绝对数量删除 d f超过 max_df 的词。CountVectorizer里的max_features：选择 tf 最大的 max_features 个特征。TfidfTransformer里的norm：数据标准化，{‘l1’, ‘l2’} or None, default=’l2’。SGDClassifier里的penalty：惩罚项。SGDClassifier里的n_iter：迭代次数。

重要的参数：

vect__max_dfvect__ngram_rangeclf__alphaclf__penalty

from sklearn.model_selection import GridSearchCVfrom sklearn.linear_model import SGDClassifierfrom sklearn.pipeline import Pipelinetext_clf_3 = Pipeline([('vect', CountVectorizer(stop_words="english", decode_error='ignore')),('tfidf', TfidfTransformer()),('clf', SGDClassifier()), ]) parameters = {'vect__max_df': (0.5, 0.75, 1.0),'vect__max_features': (None, 10000, 50000,60000),'vect__ngram_range': [(1, 1), (1, 2),(2,2)], 'tfidf__norm': ('l1', 'l2'), 'clf__alpha': (0.0001, 0.00001, 0.000001,0.0000001),'clf__penalty': ('l2', 'elasticnet'),'clf__max_iter': (10, 50, 80),}gs_clf = GridSearchCV(text_clf_3, parameters, n_jobs = -1, cv=5)grid_result = gs_clf.fit(twenty_train.data, twenty_train.target)print("Best: %f \nusing %s" % (grid_result.best_score_, grid_result.best_params_))

计算的时间有点长，最好自己验证一下。

Best: 0.931766 using {'vect__ngram_range': (1, 2), 'clf__alpha': 1e-07, 'vect__max_df': 0.75, 'vect__max_features': None, 'clf__n_iter': 50, 'clf__penalty': 'elasticnet', 'tfidf__norm': 'l1'}

总结：当文本被向量化后，就可以将其看作是数字，按照常见的机器学习方法进行回归、分类、聚类、降维等任务。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。