100字范文 > python基因差异分析_TCGA的差异基因分析

python基因差异分析_TCGA的差异基因分析

时间：2019-11-06 21:37:38

在分析TCGA数据库里的RNA-seq数据之前，先了解一下TCGA样本的id名称里的小秘密：参考文章：TCGA的样本id里藏着分组信息

根据文章里的内容，我查看了前一篇文章里我下载的count文件(利用R包TCGAbiolinks进行各种数据下载)，打开是这样的：

列是样品名称，格式举例：TCGA.BA.A4IF.01A.11R.A266.07，这里面包含分组信息，比如说这个样品是肿瘤样品还是正常组织。分组信息是在这个id的第14-15位，01-09是tumor，10-29是normal。比如第一个样品，是01A，说明这个样品是个肿瘤样品。

OK,知道了哪里是分组信息，就可以开始进行操作了。

(一)准备R包

if(!require(stringr))install.packages('stringr')

if(!require(ggplotify))install.packages("ggplotify")

if(!require(patchwork))install.packages("patchwork")

if(!require(cowplot))install.packages("cowplot")

if(!require(DESeq2))BiocManager::install('DESeq2')

if(!require(edgeR))BiocManager::install('edgeR')

if(!require(limma))BiocManager::install('limma')

(二)载入数据

这里的数据是我上一篇文章下载好的count文件，不知道怎么下的请移步：利用R包TCGAbiolinks进行各种数据下载

#加载数据

> rm(list = ls())

> a= read.csv("TCGAbiolinks_HNSC_counts.csv")

> dim(a)

[1] 56512 547

#另外要关注一下，你的这个表达矩阵的第一列是样品还是基因名，如果第一列是基因名，要把第一列设置成为行名，不然只有的差异分析会出错

#将第一列换成行名

> row.names(a)

> a

NOTE：如果你没有check你的矩阵，导致了你的矩阵第一列是基因名，后面DESeq2运行会显示：“Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are negative”这样的报错。

(三)将样品分组

> group_list

> table(group_list)

group_list

normal tumor

44 502

(四)差异分析

DESeq2方法做差异分析

> library(DESeq2)

> colData

condition=group_list) #condition是你的分组信息

> dds

countData = a,

colData = colData,

design = ~ condition)

> dds

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

-- replacing outliers and refitting for 3594 genes

-- DESeq argument 'minReplicatesForReplace' = 7

-- original counts are preserved in counts(dds)

estimating dispersions

fitting model and testing

# 两两比较

> res

> resOrdered

> DEG

> head(DEG)

baseMean log2FoldChange lfcSE stat pvalue padj

ENSG00000231887 233.5976 -7.414328 0.3208937 -23.10525 4.100424e-118 1.594778e-113

ENSG00000107159 1962.6861 5.849713 0.2969441 19.69971 2.168630e-86 4.217225e-82

ENSG00000106351 1273.1916 -2.371015 0.1206022 -19.65980 4.765762e-86 6.178493e-82

ENSG00000070081 2922. -2.145545 0.1094426 -19.60430 1.420977e-85 1.381652e-81

ENSG00000102547 434.7033 -2.210238 0.1154164 -19.15013 9.655057e-82 7.510283e-78

ENSG00000151882 882.7880 -4.178352 0.2216513 -18.85102 2.882343e-79 1.868383e-75

去除NA值:

> DEG

在DEG里添加一列名为change列，标记基因上调下调:

> logFC_cutoff

[1] 2.610411

> DEG$change = as.factor(

+ ifelse(DEG$pvalue < 0.05 & abs(DEG$log2FoldChange) > logFC_cutoff,

+ ifelse(DEG$log2FoldChange > logFC_cutoff ,'UP','DOWN'),'NOT')

+ )

> head(DEG)

baseMean log2FoldChange lfcSE stat pvalue padj change

ENSG00000231887 233.5976 -7.414328 0.3208937 -23.10525 4.100424e-118 1.594778e-113 DOWN

ENSG00000107159 1962.6861 5.849713 0.2969441 19.69971 2.168630e-86 4.217225e-82 UP

ENSG00000106351 1273.1916 -2.371015 0.1206022 -19.65980 4.765762e-86 6.178493e-82 NOT

ENSG00000070081 2922. -2.145545 0.1094426 -19.60430 1.420977e-85 1.381652e-81 NOT

ENSG00000102547 434.7033 -2.210238 0.1154164 -19.15013 9.655057e-82 7.510283e-78 NOT

ENSG00000151882 882.7880 -4.178352 0.2216513 -18.85102 2.882343e-79 1.868383e-75 DOWN

> View(DEG)

> DESeq2_DEG

最后看一下用这个方法做差异分析，上调和下调的基因有多少：

> table(DEG$change)

DOWN NOT UP

929 36955 1009

edgeR方法做差异分析

> library(edgeR)

> dge

> dge$samples$lib.size

> dge

> design

> rownames(design)

> colnames(design)

> dge

> fit

> fit2

> DEG2=topTags(fit2, n=nrow(a))

> DEG2=as.data.frame(DEG2)

> logFC_cutoff2

> DEG2$change = as.factor(

+ ifelse(DEG2$PValue < 0.05 & abs(DEG2$logFC) > logFC_cutoff2,

+ ifelse(DEG2$logFC > logFC_cutoff2 ,'UP','DOWN'),'NOT')

+ )

> head(DEG2)

logFC logCPM LR PValue FDR change

ENSG00000170369 -9.842798 4.4837048 1227.908 5.248913e-269 2.966266e-264 DOWN

ENSG00000162877 -6.470581 -0.1480235 1215.091 3.203213e-266 9.050998e-262 DOWN

ENSG00000231887 -9.129113 4.9924958 1140.670 4.781111e-250 9.006338e-246 DOWN

ENSG00000131686 -9.939181 4.2888099 1135.359 6.822362e-249 9.638633e-245 DOWN

ENSG00000101441 -10.903794 5.4644612 1059.704 1.893018e-232 2.139565e-228 DOWN

ENSG00000160349 -10.982631 6.4426001 1030.047 5.287725e-226 4.980332e-222 DOWN

看一下用edgeR方法，上调和下调基因的数量：

> table(DEG2$change)

DOWN NOT UP

1188 53914 1410

> edgeR_DEG

limma-voom方法做差异分析

> library(limma)

> design

> colnames(design)=levels(group_list)

> rownames(design)=colnames(a)

> dge

> logCPM

> v

> fit

> constrasts = paste(rev(levels(group_list)),collapse = "-")

> cont.matrix

> fit3=contrasts.fit(fit,cont.matrix)

> fit3=eBayes(fit3)

> DEG3 = topTable(fit3, coef=constrasts, n=Inf)

> DEG3 = na.omit(DEG3)

> logFC_cutoff3

> DEG3$change = as.factor(

+ ifelse(DEG3$P.Value < 0.05 & abs(DEG3$logFC) > logFC_cutoff3,

+ ifelse(DEG3$logFC > logFC_cutoff3 ,'UP','DOWN'),'NOT')

+ )

> head(DEG3)

logFC AveExpr t P.Value adj.P.Val B change

ENSG00000203740 3.394627 -3.404611 28.73533 3.090631e-111 1.746578e-106 243.1880 UP

ENSG00000096006 -8.588036 -1.083975 -27.14114 2.835110e-103 8.010887e-99 224.8165 DOWN

ENSG00000260976 3.045123 -3.341040 27.00751 1.329154e-102 2.503772e-98 223.4002 UP

ENSG00000181092 -7.489810 -4.776791 -24.73134 4.130357e-91 5.835368e-87 196.2190 DOWN

ENSG00000198478 -4.043231 3.231204 -24.35482 3.353473e-89 3.790229e-85 192.7091 DOWN

ENSG00000181355 4.163352 -2.440333 24.29635 6.639596e-89 6.253614e-85 191.7566 UP

> limma_voom_DEG

> table(DEG3$change)

DOWN NOT UP

1593 53742 1177

(五)保存三种方法得到的矩阵

> save(DESeq2_DEG,edgeR_DEG,limma_voom_DEG,group_list,file = "DEG.Rdata")

(六)热图

下面得到了三个方法得来了矩阵，就可以可视化了。教程里用的包draw_heatmap在我的R版本上安装不了，所以我用的是pheatmap画的热图。

#这一步是定义后面画的热图是三个矩阵里change那一列上调和下调的基因，不包括没变化的基因

> cg1 = rownames(DESeq2_DEG)[DESeq2_DEG$change !="NOT"]

> cg2 = rownames(edgeR_DEG)[edgeR_DEG$change !="NOT"]

> cg3 = rownames(limma_voom_DEG)[limma_voom_DEG$change !="NOT"]

> library(pheatmap)

> library(RColorBrewer)

#定义热图的颜色

> color

#下面画第一个DESeq2矩阵的热图

> mat=a[cg1,]

> n=t(scale(t(mat)))

> n[n>1]=1

> n[n< -1]= -1

> ac=data.frame(group=group_list)

> rownames(ac)=colnames(mat)

> ht1

cluster_rows = F,cluster_cols = T,

annotation_col = ac,color=color)

#下面画edgeR矩阵的热图

> mat2=a[cg2,]

> n2=t(scale(t(mat2)))

> n2[n2 > 1]=1

> n2[n2< -1]= -1

> ht2

cluster_rows = F,cluster_cols = T,

annotation_col = ac,color=color)

#下面画limma矩阵的热图

> mat3=a[cg3,]

> n3=t(scale(t(mat3)))

> n3[n3 > 1]=1

> n3[n3< -1]= -1

> ht3

cluster_rows = F,cluster_cols = T,

annotation_col = ac,color=color)

DESeq2矩阵热图

edgeR矩阵热图

limma矩阵热图

(七)火山图

> library(EnhancedVolcano)

> library(airway)

> v1=EnhancedVolcano(DESeq2_DEG,

lab = rownames(DESeq2_DEG),

x = 'log2FoldChange',

y = 'pvalue',

xlim = c(-8, 8),

title = 'DESeq2_DEG',

pCutoff = 10e-17,

FCcutoff = 2.5,

transcriptPointSize = 1.0,

transcriptLabSize = 3.0,

col=c('black', 'blue', 'green', 'red1'),

colAlpha = 1,

legend=c('NS','Log2 fold-change','P value',

'P value & Log2 fold-change'),

legendPosition = 'right',

legendLabSize = 10,

legendIconSize = 3.0,

)

> v2=EnhancedVolcano(edgeR_DEG,

lab = rownames(edgeR_DEG),

x = 'logFC',

y = 'PValue',

xlim = c(-8, 8),

title = 'edgeR_DEG',

pCutoff = 10e-17,

FCcutoff = 2.5,

transcriptPointSize = 1.0,

transcriptLabSize = 3.0,

col=c('black', 'blue', 'green', 'red1'),

colAlpha = 1,

legend=c('NS','LogFC','P value',

'P value & LogFC'),

legendPosition = 'right',

legendLabSize = 10,

legendIconSize = 3.0,

)

> v3=EnhancedVolcano(limma_voom_DEG,

lab = rownames(limma_voom_DEG),

x = 'logFC',

y = 'P.Value',

xlim = c(-8, 8),

title = 'limma_voom_DEG',

pCutoff = 10e-17,

FCcutoff = 2.5,

transcriptPointSize = 1.0,

transcriptLabSize = 3.0,

col=c('black', 'blue', 'green', 'red1'),

colAlpha = 1,

legend=c('NS','LogFC','P value',

'P value & LogFC'),

legendPosition = 'right',

legendLabSize = 10,

legendIconSize = 3.0,

)

把三个火山图组合起来：

#multiplot function

> multiplot

require(grid)

# Make a list from the ... arguments and plotlist

plots

numPlots = length(plots)

# If layout is NULL, then use 'cols' to determine layout

if (is.null(layout)) {

# Make the panel

# ncol: Number of columns of plots

# nrow: Number of rows needed, calculated from # of cols

layout

ncol = cols, nrow = ceiling(numPlots/cols))

}

if (numPlots==1) {

print(plots[[1]])

} else {

# Set up the page

grid.newpage()

pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

# Make each plot, in the correct location

for (i in 1:numPlots) {

# Get the i,j matrix positions of the regions that contain this subplot

matchidx

print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,

layout.pos.col = matchidx$col))

}

> multiplot(v1,v2,v3,cols=3)

(八)三个不同方法得到的矩阵的交集可视化

上面的可视化是用三个矩阵分别画出的热图和火山图，下面要把这三个矩阵里上调和下调的基因的交集提出来，然后再进行可视化：

> UP=function(df){

rownames(df)[df$change=="UP"]

}

> DOWN=function(df){

rownames(df)[df$change=="DOWN"]

}

> up = intersect(intersect(UP(DESeq2_DEG),UP(edgeR_DEG)),UP(limma_voom_DEG))

> down = intersect(intersect(DOWN(DESeq2_DEG),DOWN(edgeR_DEG)),DOWN(limma_voom_DEG))

> mat_total=a[c(up,down),]

> n4=t(scale(t(mat_total)))

> n4[n4 >1]=1

> n4[n4< -1]= -1

> ac=data.frame(group=group_list)

> rownames(ac)=colnames(mat_total)

> ht_combine

cluster_rows = F,cluster_cols = T,

annotation_col = ac,color=color)

三个矩阵的交集画韦恩图：

#上调、下调基因分别画维恩图

> up.plot

> down.plot

三个矩阵里上调基因的交集，一共533个

三个矩阵里下调基因的交集，一共630个

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。