100字范文,内容丰富有趣,生活中的好帮手!
100字范文 > 电影评论情感分析-直播案例

电影评论情感分析-直播案例

时间:2018-09-15 12:30:16

相关推荐

电影评论情感分析-直播案例

情感分析是机器学习中的一个有挑战性的任务。数据集包含50,000个IMDB电影评论,训练集的25,000个评论标注了二元的情感倾向,IMDB评级<5的情绪评分为0,评级> = 7的情绪评分为1,另外还有25,000个测试集评论不包含标签。

import osprint(os.listdir("./input"))

['testData.csv', 'labeledTrainData.csv']

1. 读取数据

import pandas as pd #载入数据train = pd.read_csv('./input/labeledTrainData.csv',delimiter = '\t')test = pd.read_csv('./input/testData.csv',delimiter = '\t')

train.shape, test.shapetrain.head()

train['review'][0]

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."

test.head()

test['review'][0]

"Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones. However there is a craftsmanship and completeness to the film which anyone can enjoy. The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce's short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty."

2. 数据探索分析

查看不同情感的分布情况。

print ("number of rows for sentiment 1: {}".format(len(train[train.sentiment == 1])))print ( "number of rows for sentiment 0: {}".format(len(train[train.sentiment == 0])))

number of rows for sentiment 1: 12500number of rows for sentiment 0: 12500

train.groupby('sentiment').describe().transpose()

#创建一个新的列train['length'] = train['review'].apply(len)train.head()

3. 可视化

3.1 工作准备

#导入可视化库import matplotlib.pyplot as plt%matplotlib inline#直方图统计train['length'].plot.hist(bins = 100)

正在上传…重新上传取消正在上传…重新上传取消

train.length.describe()

count 25000.000000mean1327.710560std 1005.239246min 52.00000025% 703.00000050% 981.00000075% 1617.000000max13708.000000Name: length, dtype: float64

train[train['length'] == 13708]['review'].iloc[0]

'Match 1: Tag Team Table Match Bubba Ray and Spike Dudley vs Eddie Guerrero and Chris Benoit Bubba Ray and Spike Dudley started things off with a Tag Team Table Match against Eddie Guerrero and Chris Benoit. According to the rules of the match, both opponents have to go through tables in order to get the win. Benoit and Guerrero heated up early on by taking turns hammering first Spike and then Bubba Ray. A German suplex by Benoit to Bubba took the wind out of the Dudley brother. Spike tried to help his brother, but the referee restrained him while Benoit and Guerrero ganged up on him in the corner. With Benoit stomping away on Bubba, Guerrero set up a table outside. Spike dashed into the ring and somersaulted over the top rope onto Guerrero on the outside! After recovering and taking care of Spike, Guerrero slipped a table into the ring and helped the Wolverine set it up. The tandem then set up for a double superplex from the middle rope which would have put Bubba through the table, but Spike knocked the table over right before his brother came crashing down! Guerrero and Benoit propped another table in the corner and tried to Irish Whip Spike through it, but Bubba dashed in and blocked his brother. Bubba caught fire and lifted both opponents into back body drops! Bubba slammed Guerrero and Spike stomped on the Wolverine from off the top rope. Bubba held Benoit at bay for Spike to soar into the Wassup! headbutt! Shortly after, Benoit latched Spike in the Crossface, but the match continued even after Spike tapped out. Bubba came to his brother\'s rescue and managed to sprawl Benoit on a table. Bubba leapt from the middle rope, but Benoit moved and sent Bubba crashing through the wood! But because his opponents didn\'t force him through the table, Bubba was allowed to stay in the match. The first man was eliminated shortly after, though, as Spike put Eddie through a table with a Dudley Dawg from the ring apron to the outside! Benoit put Spike through a table moments later to even the score. Within seconds, Bubba nailed a Bubba Bomb that put Benoit through a table and gave the Dudleys the win! Winner: Bubba Ray and Spike Dudley<br /><br />Match 2: Cruiserweight Championship Jamie Noble vs Billy Kidman Billy Kidman challenged Jamie Noble, who brought Nidia with him to the ring, for the Cruiserweight Championship. Noble and Kidman locked up and tumbled over the ring, but raced back inside and grappled some more. When Kidman thwarted all Noble\'s moves, Noble fled outside the ring where Nidia gave him some encouragement. The fight spread outside the ring and Noble threw his girlfriend into the challenger. Kidman tossed Nidia aside but was taken down with a modified arm bar. Noble continued to attack Kidman\'s injured arm back in the ring. Kidman\'s injured harm hampered his offense, but he continued to battle hard. Noble tried to put Kidman away with a powerbomb but the challenger countered into a facebuster. Kidman went to finish things with a Shooting Star Press, but Noble broke up the attempt. Kidman went for the Shooting Star Press again, but this time Noble just rolled out of harm\'s way. Noble flipped Kidman into a power bomb soon after and got the pin to retain his WWE Cruiserweight Championship! Winner: Jamie Noble<br /><br />Match 3: European Championship William Regal vs Jeff Hardy William Regal took on Jeff Hardy next in an attempt to win back the European Championship. Jeff catapulted Regal over the top rope then took him down with a hurracanrana off the ring apron. Back in the ring, Jeff hit the Whisper in the wind to knock Regal for a loop. Jeff went for the Swanton Bomb, but Regal got his knees up to hit Jeff with a devastating shot. Jeff managed to surprise Regal with a quick rollup though and got the pin to keep the European Championship! Regal started bawling at seeing Hardy celebrate on his way back up the ramp. Winner: Jeff Hardy<br /><br />Match 4: Chris Jericho vs John Cena Chris Jericho had promised to end John Cena\'s career in their match at Vengeance, which came up next. Jericho tried to teach Cena a lesson as their match began by suplexing him to the mat. Jericho continued to knock Cena around the ring until his cockiness got the better of him. While on the top rope, Jericho began to showboat and allowed Cena to grab him for a superplex! Cena followed with a tilt-a-whirl slam but was taken down with a nasty dropkick to the gut. The rookie recovered and hit a belly to belly suplex but couldn\'t put Y2J away. Jericho launched into the Lionsault but Cena dodged the move. Jericho nailed a bulldog and then connected on the Lionsault, but did not go for the cover. He goaded Cena to his feet so he could put on the Walls of Jericho. Cena had other ideas, reversing the move into a pin attempt and getting the 1-2-3! Jericho went berserk after the match. Winner: John Cena<br /><br />Match 5: Intercontinental Championship RVD vs Brock Lesnar via disqualification The Next Big Thing and Mr. Pay-Per-View tangled with the Intercontinental Championship on the line. Brock grabbed the title from the ref and draped it over his shoulder momentarily while glaring at RVD. Van Dam \'s quickness gave Brock fits early on. The big man rolled out of the ring and kicked the steel steps out of frustration. Brock pulled himself together and began to take charge. With Paul Heyman beaming at ringside, Brock slammed RVD to the hard floor outside the ring. From there, Brock began to overpower RVD, throwing him with ease over the top rope. RVD landed painfully on his back, then had to suffer from having his spine cracked against the steel ring steps. The fight returned to the ring with Brock squeezing RVD around the ribs. RVD broke away and soon after leveled Brock with a kick to the temple. RVD followed with the Rolling Thunder but Brock managed to kick out after a two-count. The fight looked like it might be over soon as RVD went for a Five-Star Frog Splash. Brock, though, hoisted Van Dam onto his shoulder and went for the F-5, but RVD whirled Brock into a DDT and followed with the Frog Splash! He went for the pin, but Heyman pulled the ref from the ring! The ref immediately called for a disqualification and soon traded blows with Heyman! After, RVD leapt onto Brock from the top rope and then threatened to hit the Van Terminator! Heyman grabbed RVD\'s leg and Brock picked up the champ and this time connected with the F-5 onto a steel chair! Winner: RVD<br /><br />Match 6: Booker T vs the Big Show Booker T faced the Big Show one-on-one next. Show withstood Booker T\'s kicks and punches and slapped Booker into the corner. After being thrown from the ring, Booker picked up a chair at ringside, but Big Show punched it back into Booker\'s face. Booker tried to get back into the game by choking Show with a camera cable at ringside. Booker smashed a TV monitor from the Spanish announcers\' position into Show\'s skull, then delivered a scissors kick that put both men through the table! Booker crawled back into the ring and Big Show staggered in moments later. Show grabbed Booker\'s throat but was met by a low blow and a kick to the face. Booker climbed the top rope and nailed a somersaulting leg drop to get the pin! Winner: Booker T<br /><br />Announcement: Triple H entered the ring to a thunderous ovation as fans hoped to learn where The Game would end up competing. Before he could speak, Eric Bishoff stopped The Game to apologize for getting involved in his personal business. If Triple H signed with RAW, Bischoff promised his personal life would never come into play again. Bischoff said he\'s spent the past two years networking in Hollywood. He said everyone was looking for the next breakout WWE Superstar, and they were all talking about Triple H. Bischoff guaranteed that if Triple H signed with RAW, he\'d be getting top opportunities coming his way. Stephanie McMahon stepped out to issue her own pitch. She said that because of her personal history with Triple H, the two of them know each other very well. She said the two of them were once unstoppable and they can be again. Bischoff cut her off and begged her to stop. Stephanie cited that Triple H once told her how Bischoff said Triple H had no talent and no charisma. Bischoff said he was young at the time and didn\'t know what he had, but he still has a lot more experience that Stephanie. The two continued to bicker back and forth, until Triple H stepped up with his microphone. The Game said it would be easy to say \\screw you\\" to either one of them. Triple H went to shake Bischoff\'s hand, but pulled it away. He said he would rather go with the devil he knows, rather than the one he doesn\'t know. Before he could go any further, though, Shawn Michaels came out to shake things up. HBK said the last thing he wanted to do was cause any trouble. He didn\'t want to get involved, but he remembered pledging to bring Triple H to the nWo. HBK said there\'s nobody in the world that Triple H is better friends with. HBK told his friend to imagine the two back together again, making Bischoff\'s life a living hell. Triple H said that was a tempting offer. He then turned and hugged HBK, making official his switch to RAW! Triple H and HBK left, and Bischoff gloated over his victory. Bischoff said the difference between the two of them is that he\'s got testicles and she doesn\'t. Stephanie whacked Bischoff on the side of the head and left!<br /><br />Match 7: Tag Team Championship Match Christian and Lance Storm vs Hollywood Hogan and Edge The match started with loud \\"USA\\" chants and with Hogan shoving Christian through the ropes and out of the ring. The Canadians took over from there. But Edge scored a kick to Christian\'s head and planted a facebuster on Storm to get the tag to Hogan. Hogan began to Hulk up and soon caught Christian with a big boot and a leg drop! Storm broke up the count and Christian tossed Hogan from the ring where Storm superkicked the icon. Edge tagged in soon after and dropped both opponents. He speared both of them into the corner turnbuckles, but missed a spear on Strom and hit the ref hard instead. Edge nailed a DDT, but the ref was down and could not count. Test raced down and took down Hogan then leveled Edge with a boot. Storm tried to get the pin, but Edge kicked out after two. Riksihi sprinted in to fend off Test, allowing Edge to recover and spear Storm. Christian distracted the ref, though, and Y2J dashed in and clocked Edge with the Tag Team Championship! Storm rolled over and got the pinfall to win the title! Winners and New Tag Team Champions: Christian and Lance Storm<br /><br />Match 8: WWE Undisputed Championship Triple Threat Match. The Rock vs Kurt Angle and the Undertaker Three of WWE\'s most successful superstars lined up against each other in a Triple Threat Match with the Undisputed Championship hanging in the balance. Taker and The Rock got face to face with Kurt Angle begging for some attention off to the side. He got attention in the form of a beat down form the two other men. Soon after, Taker spilled out of the ring and The Rock brawled with Angle. Angle gave a series of suplexes that took down Rock, but the Great One countered with a DDT that managed a two-count. The fight continued outside the ring with Taker coming to life and clotheslining Angle and repeatedly smacking The Rock. Taker and Rock got into it back into the ring, and Taker dropped The Rock with a sidewalk slam to get a two-count. Rock rebounded, grabbed Taker by the throat and chokeslammed him! Angle broke up the pin attempt that likely would have given The Rock the title. The Rock retaliated by latching on the ankle lock to Kurt Angle. Angle reversed the move and Rock Bottomed the People\'s Champion. Soon after, The Rock disposed of Angle and hit the People\'s Elbow on the Undertaker. Angle tried to take advantage by disabling the Great One outside the ring and covering Taker, who kicked out after a two count. Outside the ring, Rock took a big swig from a nearby water bottle and spewed the liquid into Taker\'s face to blind the champion. Taker didn\'t stay disabled for long, and managed to overpower Rock and turn his attention to Angle. Taker landed a guillotine leg drop onto Angle, laying on the ring apron. The Rock picked himself up just in time to break up a pin attempt on Kurt Angle. Taker nailed Rock with a DDT and set him up for a chokeslam. ANgle tried sneaking up with a steel chair, but Taker caught on to that tomfoolery and smacked it out of his hands. The referee got caught in the ensuing fire and didn\'t see Angle knock Taker silly with a steel chair. Angle went to cover Taker as The Rock lay prone, but the Dead Man somehow got his shoulder up. Angle tried to pin Rock, but he too kicked out. The Rock got up and landed Angle in the sharpshooter! Angle looked like he was about to tap, but Taker kicked The Rock out of the submission hold. Taker picked Rock up and crashed him with the Last Ride. While the Dead Man covered him for the win, Angle raced in and picked Taker up in the ankle lock! Taker went delirious with pain, but managed to counter. He picked Angle up for the last ride, but Angle put on a triangle choke! It looked like Taker was about to pass out, but The Rock broke Angle\'s hold only to find himself caught in the ankle lock. Rock got out of the hold and watched Taker chokeslam Angle. Rocky hit the Rock Bottom, but Taker refused to go down and kicked out. Angle whirled Taker up into the Angle Slam but was Rock Bottomed by the Great One and pinned! Winner and New WWE Champion: The Rock<br /><br />~Finally there is a decent PPV! Lately the PPV weren\'t very good, but this one was a winner. I give this PPV a A-<br /><br />"'

train.hist(column='length', by='sentiment', bins=100,figsize=(12,4))

3.2 文本预处理

去除html标签去除所有非字母字符转化为小写分词去除停用词

如果没有这个库,可以用以下方法安装:

sudo pip install BeautifulSoup4

若没有stopwords,可以运行以下代码:

import nltk nltk.download("stopwords")

#导入预处理所需要的包from bs4 import BeautifulSoupimport reimport nltk from nltk.corpus import stopwords# 去除html标签raw_text = BeautifulSoup(train["review"][0],"lxml").get_text()print(raw_text)

With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.

#去除所有非字母字符 letters_only = re.sub("[^a-zA-Z]", " ", raw_text) print(letters_only)#转化大小写letters_lowercase = letters_only.lower() print(letters_lowercase)#对于文档进行分词words = letters_lowercase.split()print(words)#去除停用词#1.创建停用词词表#2.去除停用词stops = set(stopwords.words("english"))#stops.add("stuff")clean_review1 = [w for w in words if not w in stops]print(clean_review1)

def clean_text(raw_text):raw_text = BeautifulSoup(raw_text,"lxml").get_text() letters_only = re.sub("[^a-zA-Z]", " ", raw_text) words = letters_only.lower().split() stops = set(stopwords.words("english")) return [w for w in words if not w in stops]#对于review列进行处理train['clean_review'] = train['review'].apply(clean_text)#train['clean_review'] = [clean_text(e,stops)for e in train['review']]#增加一列:处理后的长度train['length_clean_review'] = train['clean_review'].apply(len)train.head()

train.describe()#Checking the smallest reviewprint(train[train['length_clean_review'] == 4]['review'].iloc[0])print('------After Cleaning------')print(train[train['length_clean_review'] == 4]['clean_review'].iloc[0])

This movie is terrible but it has some good effects.------After Cleaning------['movie', 'terrible', 'good', 'effects']

3.3 使用词云可视化

### 词云#画出from wordcloud import WordCloudword_cloud = WordCloud(width = 1000, height = 500, background_color = 'black').generate(''.join(train['review']))plt.figure(figsize = (15,8))plt.imshow(word_cloud)plt.axis('off')plt.show()

word_cloud = WordCloud(width = 1000, height = 500, background_color = 'black').generate(''.join(str(train['clean_review'])))plt.figure(figsize = (15,8))plt.imshow(word_cloud)plt.axis('off')plt.show()

4. 向量化

现在我们需要将处理好的文本转化为机器学习的模型可以处理的形式,这里我们使用不同的模型来对文本进行向量化。 可以分为以下四个步骤。

计算布尔向量计算TF计算IDF对文本向量进行降维操作

我们尝试分别使用bool型特征、TF特征、TF-IDF特征进行模型训练。

4.1 计算布尔特征向量

4.1.1在CountVectorizer中我们指定analyzermax_featurebinary

from sklearn.feature_extraction.text import CountVectorizer# 运行时间可能稍长bool_transformer = CountVectorizer(analyzer=clean_text,binary = True,max_features=5000).fit(train['review']) # 打印出词# print(bool_transformer.vocabulary_)# 打印前100组数据a = list(bool_transformer.vocabulary_.items())print(a[:100])

[('stuff', 4258), ('going', 1902), ('moment', 2859), ('started', 4173), ('listening', 2590), ('music', 2916), ('watching', 4838), ('odd', 3051), ('documentary', 1270), ('watched', 4836), ('maybe', 2747), ('want', 4814), ('get', 1872), ('certain', 674), ('insight', 2270), ('guy', 1978), ('thought', 4474), ('really', 3538), ('cool', 947), ('eighties', 1393), ('make', 2686), ('mind', 2819), ('whether', 4880), ('guilty', 1972), ('innocent', 2266), ('part', 3159), ('feature', 1646), ('film', 1687), ('remember', 3617), ('see', 3865), ('cinema', 757), ('originally', 3097), ('released', 3598), ('subtle', 4278), ('messages', 2789), ('feeling', 1653), ('towards', 4550), ('press', 3352), ('also', 146), ('obvious', 3041), ('message', 2788), ('drugs', 1332), ('bad', 330), ('visually', 4780), ('impressive', 2220), ('course', 977), ('michael', 2799), ('jackson', 2347), ('unless', 4687), ('remotely', 3624), ('like', 2569), ('anyway', 208), ('hate', 2026), ('find', 1697), ('boring', 485), ('may', 2746), ('call', 590), ('making', 2691), ('movie', 2895), ('fans', 1615), ('would', 4960), ('say', 3812), ('made', 2670), ('true', 4610), ('nice', 2979), ('actual', 54), ('bit', 431), ('finally', 1695), ('starts', 4175), ('minutes', 2830), ('smooth', 4059), ('criminal', 1019), ('sequence', 3899), ('joe', 2378), ('convincing', 943), ('powerful', 3329), ('drug', 1331), ('lord', 2625), ('wants', 4817), ('dead', 1092), ('beyond', 417), ('plans', 3260), ('character', 697), ('wanted', 4815), ('people', 3188), ('know', 2458), ('etc', 1486), ('hates', 2028), ('lots', 2635), ('things', 4465), ('turning', 4622), ('car', 617), ('robot', 3724), ('whole', 4884), ('speed', 4121), ('demon', 1137), ('director', 1225), ('must', 2920), ('patience', 3178), ('saint', 3789)]

使用定义好的bool_transformer来处理一条影评。

4.1.2查看向量表示

review1 = train['review'][0]bow1 = bool_transformer.transform([review1])print(bow1)print(bow1.toarray())print(bow1.shape)print(bow1.nnz)

4.2 计算TF特征向量

4.2.1在CountVectorizer中我们指定analyzermax_featurebinary

from sklearn.feature_extraction.text import CountVectorizer# 运行时间可能稍长bow_transformer = CountVectorizer(analyzer=clean_text,binary=False,max_features=5000).fit(train['review'])

4.2.2在CountVectorizer中我们指定analyzermax_feature

# 运行时间可能稍长bow_transformer = CountVectorizer(analyzer=clean_text,max_features=5000).fit(train['review']) # 打印出词print(len(bow_transformer.vocabulary_))

使用定义好的bow_transformer来处理一条影评。

4.2.3查看向量表示

review1 = train['review'][0]bow1 = bow_transformer.transform([review1])print(bow1.toarray())print(bow1)print(bow1.shape)print(bow1.nnz)#每个数字都有对应的词print(bow_transformer.get_feature_names()[480])print(bow_transformer.get_feature_names()[4947])#存入矩阵review_bow = bow_transformer.transform(train['review'])

4.2.4查看稀疏矩阵中非零数字的占比

print('Shape of Sparse Matrix: ', review_bow.shape)print('Amount of Non-Zero occurences: ', review_bow.nnz)

Shape of Sparse Matrix: (25000, 5000)Amount of Non-Zero occurences: 1980474

#检查矩阵稀疏性sparsity = (review_bow.nnz / (review_bow.shape[0] * review_bow.shape[1]))print('sparsity: {}'.format(sparsity))

sparsity: 0.015843792

from sklearn.feature_extraction.text import TfidfTransformertfidf_transformer = TfidfTransformer(norm="l2",smooth_idf=True).fit(review_bow)tfidf1 = tfidf_transformer.transform(bow1)print(tfidf1)print(tfidf1.shape)print(tfidf1.nnz)

4.2.5检查IDF计算出来的数值

print(tfidf_transformer.idf_[bow_transformer.vocabulary_['well']])print(tfidf_transformer.idf_[bow_transformer.vocabulary_['book']])

2.19962245522466443.8577509748566374

#将BOW转化为TF-IDFreview_tfidf = tfidf_transformer.transform(review_bow)print(review_tfidf.shape)

(25000, 5000)

5. 特征降维

5.1 使用潜在语义分析的方法对于矩阵进行降维

from sklearn.decomposition import TruncatedSVDLSA = TruncatedSVD(n_components=300, n_iter=7, random_state=42)LSA.fit(review_tfidf) #print(LSA.explained_variance_ratio_) #print(LSA.explained_variance_ratio_.sum()) #print(LSA.singular_values_)

TruncatedSVD(algorithm='randomized', n_components=300, n_iter=7,random_state=42, tol=0.0)

6. 文本建模

6.1 训练集划分

from sklearn.metrics import classification_report#定义对于模型进行评估的函数def pred(predicted,compare):cm = pd.crosstab(compare,predicted)TN = cm.iloc[0,0]FN = cm.iloc[1,0]TP = cm.iloc[1,1]FP = cm.iloc[0,1]print("CONFUSION MATRIX ------->> ")print(cm)print()##计算模型的准确率print('Classification paradox :------->>')print('Accuracy :- ', round(((TP+TN)*100)/(TP+TN+FP+FN),2))print()print('False Negative Rate :- ',round((FN*100)/(FN+TP),2))print()print('False Postive Rate :- ',round((FP*100)/(FP+TN),2))print()print(classification_report(compare,predicted))

6.3 训练模型

6.3.1随机森林

from sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import GridSearchCVfrom sklearn.pipeline import Pipelinefrom sklearn.decomposition import TruncatedSVDtuned_parameters = {'n_estimators': [100,150], 'max_depth': [5,10]}rfc = RandomForestClassifier(random_state=42)pipeline_bool = Pipeline([('bool', CountVectorizer(analyzer=clean_text,binary=True,max_features=5000)), ("LSA" , TruncatedSVD(n_components=300, n_iter=7, random_state=42)),("classifier",GridSearchCV(rfc, tuned_parameters, cv=5, scoring='r2', n_jobs=4, verbose=1))])pipeline_bool.fit(X_train,y_train)predictions = pipeline_bool.predict(X_train)pred(predictions,y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits

[Parallel(n_jobs=4)]: Done 20 out of 20 | elapsed: 1.3min finished

CONFUSION MATRIX ------->> col_0 01sentiment 07074 3601 257 7309Classification paradox :------->>Accuracy :- 95.89False Negative Rate :- 3.4False Postive Rate :- 4.84precision recall f1-score support0 0.960.950.9674341 0.950.970.967566avg / total 0.960.960.9615000

from sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import GridSearchCVfrom sklearn.pipeline import Pipelinefrom sklearn.decomposition import TruncatedSVDtuned_parameters = {'n_estimators': [100,150], 'max_depth': [5,10]}rfc = RandomForestClassifier(random_state=42)pipeline_tf = Pipeline([('bow', CountVectorizer(analyzer=clean_text,max_features=5000)), ("LSA" , TruncatedSVD(n_components=300, n_iter=7, random_state=42)),("classifier",GridSearchCV(rfc, tuned_parameters, cv=5, scoring='r2', n_jobs=4, verbose=1))])pipeline_tf.fit(X_train,y_train)predictions = pipeline_tf.predict(X_train)pred(predictions,y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits

[Parallel(n_jobs=4)]: Done 20 out of 20 | elapsed: 1.3min finished

CONFUSION MATRIX ------->> col_0 01sentiment 07030 4041 319 7247Classification paradox :------->>Accuracy :- 95.18False Negative Rate :- 4.22False Postive Rate :- 5.43precision recall f1-score support0 0.960.950.9574341 0.950.960.957566avg / total 0.950.950.9515000

from sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import GridSearchCVfrom sklearn.pipeline import Pipelinefrom sklearn.decomposition import TruncatedSVDtuned_parameters = {'n_estimators': [100,150], 'max_depth': [5,10]}rfc = RandomForestClassifier(random_state=42)pipeline_tfidf = Pipeline([('bow', CountVectorizer(analyzer=clean_text,max_features=5000)), ('tfidf', TfidfTransformer()),("LSA" , TruncatedSVD(n_components=300, n_iter=7, random_state=42)),("classifier",GridSearchCV(rfc, tuned_parameters, cv=5, scoring='r2', n_jobs=4, verbose=1))])pipeline_tfidf.fit(X_train,y_train)predictions = pipeline_tfidf.predict(X_train)pred(predictions,y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits

[Parallel(n_jobs=4)]: Done 20 out of 20 | elapsed: 1.3min finished

CONFUSION MATRIX ------->> col_0 01sentiment 06900 5341 389 7177Classification paradox :------->>Accuracy :- 93.85False Negative Rate :- 5.14False Postive Rate :- 7.18precision recall f1-score support0 0.950.930.9474341 0.930.950.947566avg / total 0.940.940.9415000

#在测试集上面运行结果predictions = pipeline_bool.predict(X_test)pred(predictions,y_test)

CONFUSION MATRIX ------->> col_0 01sentiment 03810 12561 915 4019Classification paradox :------->>Accuracy :- 78.29False Negative Rate :- 18.54False Postive Rate :- 24.79precision recall f1-score support0 0.810.750.7850661 0.760.810.794934avg / total 0.780.780.7810000

#在测试集上面运行结果predictions = pipeline_tf.predict(X_test)pred(predictions,y_test)

CONFUSION MATRIX ------->> col_0 01sentiment 03690 137611000 3934Classification paradox :------->>Accuracy :- 76.24False Negative Rate :- 20.27False Postive Rate :- 27.16precision recall f1-score support0 0.790.730.7650661 0.740.800.774934avg / total 0.760.760.7610000

#在测试集上面运行结果predictions = pipeline_tfidf.predict(X_test)pred(predictions,y_test)

CONFUSION MATRIX ------->> col_0 01sentiment 03910 11561 897 4037Classification paradox :------->>Accuracy :- 79.47False Negative Rate :- 18.18False Postive Rate :- 22.82precision recall f1-score support0 0.810.770.7950661 0.780.820.804934avg / total 0.800.790.7910000

最终应用我们得到的最优的模型对无标注的数据集进行预测:

test['sentiment'] = pipeline_tfidf.predict(test['review'])output = test[['id','sentiment']]print(output)

id sentiment012311_1011 8348_202 5828_413 7186_204 12128_715 2913_816 4396_107 395_208 10616_109 9074_9110 9252_3011 9896_9012 574_411311182_811411656_4015 2322_4116 8703_1117 7483_11186007_1011912424_4020 4672_102110841_3022 8954_7023 7392_102410288_8125 5343_4026 4950_1027 9257_4028 8689_3029 4480_21... ... ...24970 6857_10124971 11091_8124972 4167_2124973679_4124974 10147_1024975 6875_1024976 923_10124977 6200_8024978 7208_8124979 5363_8124980 4067_8024981 1773_7124982 1498_10124983 10497_10024984 3444_10124985588_2024986 9678_9124987 1983_9024988 5012_3124989 12240_2124990 5071_2024991 5078_2024992 10069_3024993 7407_8124994 7207_1024995 2155_1012499659_10124997 2531_1124998 7772_8124999 11465_101[25000 rows x 2 columns]

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。