100字范文 > 高通量测序分析工具Bedtools使用介绍

高通量测序分析工具Bedtools使用介绍

时间：2020-04-17 02:42:02

//04/bedtools/

Bedtools是处理基因组信息分析的强大工具集合，其主要功能如下：

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.usage: bedtools <subcommand> [options]The bedtools sub-commands include:[ Genome arithmetic ]intersectFind overlapping intervals in various ways.求区域之间的交集，可以用来注释peak，计算reads比对到的基因组区域不同样品的peak之间的peak重叠情况。window Find overlapping intervals within a window around an interval.closest Find the closest, potentially non-overlapping interval.寻找最近但可能不重叠的区域coverageCompute the coverage over defined intervals.计算区域覆盖度map Apply a function to a column for each overlapping interval.genomecovCompute the coverage over an entire genome.merge Combine overlapping/nearby intervals into a single interval.合并重叠或相接的区域cluster Cluster (but don't merge) overlapping/nearby plement Extract intervals _not_ represented by an interval file.获得互补区域subtractRemove intervals based on overlaps b/w two files.计算区域差集slopAdjust the size of intervals.调整区域大小，如获得转录起始位点上下游3 K的区域flank Create new intervals from the flanks of existing intervals.sortOrder the intervals in a file.排序，部分命令需要排序过的bed文件random Generate random intervals in a genome.获得随机区域，作为背景集shuffle Randomly redistrubute intervals in a genome.根据给定的bed文件获得随机区域，作为背景集sample Sample random records from file using reservoir sampling.spacing Report the gap lengths between intervals in a file.annotateAnnotate coverage of features from multiple files.[ Multi-way file comparisons ]multiinter Identifies common intervals among multiple interval files.unionbedgCombines coverage intervals from multiple BEDGRAPH files.[ Paired-end manipulation ]pairtobedFind pairs that overlap intervals in various ways.pairtopair Find pairs that overlap other pairs in various ways.[ Format conversion ]bamtobedConvert BAM alignments to BED (& other) formats.bedtobamConvert intervals to BAM records.bamtofastq Convert BAM records to FASTQ records.bedpetobam Convert BEDPE intervals to BAM records.bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.[ Fasta manipulation ]getfastaUse intervals to extract sequences from a FASTA file.提取给定位置的FASTA序列maskfastaUse intervals to mask sequences from a FASTA file.nuc Profile the nucleotide content of intervals in a FASTA file.[ BAM focused tools ]multicovCounts coverage from multiple BAMs at specific intervals.tag Tag BAM alignments based on overlaps with interval files.[ Statistical relationships ]jaccard Calculate the Jaccard statistic b/w two sets of intervals.计算数据集相似性reldist Calculate the distribution of relative distances b/w two files.fisher Calculate Fisher statistic b/w two feature files.[ Miscellaneous tools ]overlap Computes the amount of overlap from two intervals.igv Create an IGV snapshot batch script.用于生成一个脚本，批量捕获IGV截图links Create a HTML page of links to UCSC locations.makewindows Make interval "windows" across a genome.把给定区域划分成指定大小和间隔的小区间 (bin)groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")分组结算，不只可以用于bed文件。expand Replicate lines based on lists of values in columns.split Split a file into multiple files with equal records or base pairs.

安装bedtools

（Linux - Conda软件安装方法）

ct@ehbio:~$ conda install bedtools

获得测试数据集(/tutorials/bedtools/bedtools.html)

ct@ehbio:~$ mkdir bedtoolsct@ehbio:~$ cd bedtoolsct@ehbio:~$ url=/bedtools-tutorials/webct@ehbio:~/bedtools$ curl -O ${url}/maurano.dnaseI.tgzct@ehbio:~/bedtools$ curl -O ${url}/cpg.bedct@ehbio:~/bedtools$ curl -O ${url}/exons.bedct@ehbio:~/bedtools$ curl -O ${url}/gwas.bedct@ehbio:~/bedtools$ curl -O ${url}/genome.txtct@ehbio:~/bedtools$ curl -O ${url}/hesc.chromHmm.bed

交集 (intersect)

查看输入文件，bed格式，至少三列，分别是染色体，起始位置(0-based,

包括)，终止位置

(1-based，不包括)。第四列一般为区域名字，第五列一般为空，第六列为链的信息。更详细解释见http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1。

自己做研究CpG岛信息可以从UCSC的Table Browser获得，具体操作见//05/ucsc-usages/。

ct@ehbio:~/bedtools$ head -n 3 cpg.bed exons.bed==> cpg.bed <==chr1 28735 29810 CpG:_116chr1 135124 135563 CpG:_30chr1 327790 328229 CpG:_29==> exons.bed <==chr1 11873 12227 NR_046018_exon_0_0_chr1_11874_f 0 +chr1 12612 12721 NR_046018_exon_1_0_chr1_12613_f 0 +chr1 13220 14409 NR_046018_exon_2_0_chr1_13221_f 0 +

获得重叠区域(既是外显子，又是CpG岛的区域)

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed | head -5chr1 29320 29370 CpG:_116chr1 135124 135563 CpG:_30chr1 327790 328229 CpG:_29chr1 327790 328229 CpG:_29chr1 327790 328229 CpG:_29

输出重叠区域对应的原始区域(与外显子存在交集的CpG岛)

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wa -wb > | head -5

chr1 28735 29810 CpG:_116 chr1 29320 29370

NR_024540_exon_10_0_chr1_29321_r 0 -

chr1 135124 135563 CpG:_30 chr1 134772 139696

NR_039983_exon_0_0_chr1_134773_r 0 -

chr1 327790 328229 CpG:_29 chr1 324438 328581

NR_028322_exon_2_0_chr1_324439_f 0 +

chr1 327790 328229 CpG:_29 chr1 324438 328581

NR_028325_exon_2_0_chr1_324439_f 0 +

chr1 327790 328229 CpG:_29 chr1 327035 328581

NR_028327_exon_3_0_chr1_327036_f 0 +

计算重叠碱基数

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10

chr1 28735 29810 CpG:_116 chr1 29320 29370

NR_024540_exon_10_0_chr1_29321_r 0 - 50

chr1 135124 135563 CpG:_30 chr1 134772 139696

NR_039983_exon_0_0_chr1_134773_r 0 - 439

chr1 327790 328229 CpG:_29 chr1 324438 328581

NR_028322_exon_2_0_chr1_324439_f 0 + 439

chr1 327790 328229 CpG:_29 chr1 324438 328581

NR_028325_exon_2_0_chr1_324439_f 0 + 439

chr1 327790 328229 CpG:_29 chr1 327035 328581

NR_028327_exon_3_0_chr1_327036_f 0 + 439

chr1 713984 714547 CpG:_60 chr1 713663 714068

NR_033908_exon_6_0_chr1_713664_r 0 - 84

chr1 762416 763445 CpG:_115 chr1 761585 762902

NR_024321_exon_0_0_chr1_761586_r 0 - 486

chr1 762416 763445 CpG:_115 chr1 762970 763155

NR_015368_exon_0_0_chr1_762971_f 0 + 185

chr1 762416 763445 CpG:_115 chr1 762970 763155

NR_047519_exon_0_0_chr1_762971_f 0 + 185

chr1 762416 763445 CpG:_115 chr1 762970 763155

NR_047520_exon_0_0_chr1_762971_f 0 + 185

计算第一个(-a)bed区域有多少个重叠的第二个(-b)bed文件中有多少个区域

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -c | headchr1 28735 29810 CpG:_116 1chr1 135124 135563 CpG:_30 1chr1 327790 328229 CpG:_29 3chr1 437151 438164 CpG:_84 0chr1 449273 450544 CpG:_99 0chr1 533219 534114 CpG:_94 0chr1 544738 546649 CpG:_171 0chr1 713984 714547 CpG:_60 1chr1 762416 763445 CpG:_115 10chr1 788863 789211 CpG:_28 9

另外还有-v取出不重叠的区域,

-f限定重叠最小比例，-sorted可以对按sort -k1,1 -k2,2n排序好的文件加速操作。

同时对多个区域求交集 (可以用于peak的多维注释)

# -names标注注释来源# -sorted: 如果使用了这个参数，提供的一定是排序好的bed文件ct@ehbio:~/bedtools$ bedtools intersect -a exons.bed \-b cpg.bed gwas.bed hesc.chromHmm.bed -sorted -wa -wb -names cpg gwas chromhmm \| head -10000 | tail -10

chr1 27632676 27635124

NM_001276252_exon_15_0_chr1_27632677_chromhmm chr1 27633213

27635013 5_Strong_Enhancer

chr1 27632676 27635124

NM_001276252_exon_15_0_chr1_27632677_chromhmm chr1 27635013

27635413 7_Weak_Enhancer

chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

chromhmm chr1 27632613 27632813 6_Weak_Enhancer

chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

chromhmm chr1 27632813 27633213 7_Weak_Enhancer

chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

chromhmm chr1 27633213 27635013 5_Strong_Enhancer

chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

chromhmm chr1 27635013 27635413 7_Weak_Enhancer

chr1 27648635 27648882 NM_032125_exon_0_0_chr1_27648636_f cpg

chr1 27648453 27649006 CpG:_63

chr1 27648635 27648882 NM_032125_exon_0_0_chr1_27648636_f

chromhmm chr1 27648613 27649413 1_Active_Promoter

chr1 27648635 27648882 NR_037576_exon_0_0_chr1_27648636_f cpg

chr1 27648453 27649006 CpG:_63

chr1 27648635 27648882 NR_037576_exon_0_0_chr1_27648636_f

chromhmm chr1 27648613 27649413 1_Active_Promoter

合并区域

bedtools merge输入的是按sort -k1,1 -k2,2n排序好的bed文件。

只需要输入一个排序好的bed文件，默认合并重叠或邻接区域。

ct@ehbio:~/bedtools$ bedtools merge -i exons.bed | head -n 5chr1 11873 12227chr1 12612 12721chr1 13220 14829chr1 14969 15038chr1 15795 15947

合并区域并输出此合并后区域是由几个区域合并来的

ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -c 1 -o count | head -n 5chr1 11873 12227 1chr1 12612 12721 1chr1 13220 14829 2chr1 14969 15038 1chr1 15795 15947 1

合并相距90 nt内的区域，并输出是由哪些区域合并来的

# -c: 指定对哪些列进行操作# -o: 与-c对应，表示对指定列进行哪些操作# 这里的用法是对第一列做计数操作，输出这个区域是由几个区域合并来的# 对第4列做收集操作，记录合并的区域的名字，并逗号分隔显示出来ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -d 340 -c 1,4 -o count,collapse | head -4chr1 11873 12227 1 NR_046018_exon_0_0_chr1_11874_fchr1 12612 12721 1 NR_046018_exon_1_0_chr1_12613_fchr1 13220 15038 3 NR_046018_exon_2_0_chr1_13221_f,NR_024540_exon_0_0_chr1_14362_r,NR_024540_exon_1_0_chr1_14970_rchr1 15795 15947 1 NR_024540_exon_2_0_chr1_15796_r

计算互补区域

给定一个全集，再给定一个子集，求另一个子集。比如给定每条染色体长度和外显子区域，求非外显子区域。给定基因区，求非基因区。给定重复序列，求非重复序列等。

重复序列区域的获取也可以用上面提供的链接

//05/ucsc-usages/。

ct@ehbio:~/bedtools$ head genome.txt chr1 249250621chr10 135534747chr11 135006516chr11_gl000202_random 40103chr12 133851895chr13 115169878chr14 107349540chr15 102531392ct@ehbio:~/bedtools$ bedtools complement -i exons.bed -g genome.txt | head -n 5chr1 0 11873chr1 12227 12612chr1 12721 13220chr1 14829 14969chr1 15038 15795

基因组覆盖广度和深度

计算基因组某个区域是否被覆盖，覆盖深度多少。有下图多种输出格式，也支持RNA-seq数据，计算junction-reads覆盖。

genome.txt里面的内容就是染色体及对应的长度。

# 对单行FASTA，可如此计算# 如果是多行FASTA，则需要累加ct@ehbio:~/bedtools$ awk 'BEGIN{OFS=FS="\t"}{\if($0~/>/) {seq_name=$0;sub(">","",seq_name);} \else {print seq_name,length;} }' ../bio/genome.fa | tee ../bio/genome.txt chr1 60001chr2 54001chr3 54001chr4 60001ct@ehbio:~/bedtools$ bedtools genomecov -ibam ../bio/map.sortP.bam -bga \-g ../bio/genome.txt | head# 这个warning很有意思，因为BAM中已经有这个信息了，就不需要提供了**********WARNING: Genome (-g) files are ignored when BAM input is provided. *****# bedgraph文件，前3列与bed相同，最后一列表示前3列指定的区域的覆盖度。chr1 0 11 0chr1 11 17 1chr1 17 20 2chr1 20 31 3chr1 31 36 4chr1 36 43 6chr1 43 44 7chr1 44 46 8chr1 46 48 9chr1 48 54 10

两个思考题：

怎么计算有多少基因组区域被测到了？
怎么计算平均测序深度是多少？

数据集相似性

bedtools jaccard计算的是给定的两个bed文件之间交集区域(intersection)占总区域(union-intersection)的比例(jaccard)和交集的数目(n_intersections)。

ct@ehbio:~/bedtools$ bedtools jaccard \-a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \-b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bedintersection union-intersection jaccard n_intersections81269248 160493950 0.50637 130852

小思考：1. 如何用bedtools其它工具算出这个结果？2.

如果需要比较的文件很多，怎么充分利用计算资源？

一个办法是使用for循环,

双层嵌套。这种用法也很常见，不管是单层还是双层for循环，都有利于简化重复运算。

ct@ehbio:~/bedtools$ for i in *.merge.bed; do \for j in *.merge.bed; do \bedtools jaccard -a $i -b $j | cut -f3 | tail -n +2 | sed "s/^/$i\t$j\t/"; \done; done >total.similarity

另一个办法是用parallel，不只可以批量，更可以并行。

root@ehbio:~# yum install parallel.noarch# parallel 后面双引号("")内的内容为希望用parallel执行的命令，# 整体写法与Linux下命令写法一致。# 双引号后面的三个相邻冒号 (:::)默认用来传递参数的，可多个连写。# 每个三冒号后面的参数会被循环调用，而在命令中的引用则是根据其出现的位置，分别用{1}, {2}# 表示第一个三冒号后的参数，第二个三冒号后的参数。## 这个命令可以替换原文档里面的整合和替换, 相比于原文命令生成多个文件，这里对每个输出结果# 先进行了比对信息的增加，最后结果可以输入一个文件中。#ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 \| sed 's/^/{1}\t{2}\t/'" ::: `ls *.merge.bed` ::: `ls *.merge.bed` >totalSimilarity.2# 上面的命令也有个小隐患，并行计算时的输出冲突问题，可以修改为输出到单个文件,再cat到一起ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 \| sed 's/^/{1}\t{2}\t/' >{1}.{2}.totalSimilarity_tmp" ::: `ls *.merge.bed` ::: `ls *.merge.bed` ct@ehbio:~/bedtools$ cat *.totalSimilarity_tmp >totalSimilarity.2# 替换掉无关信息ct@ehbio:~/bedtools$ sed -i -e 's/.hotspot.twopass.fdr0.05.merge.bed//' \-e 's/.hg19//' totalSimilarity.2

原文档的命令，稍微有些复杂，利于学习不同命令的组合。使用时推荐使用上面的命令。

ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} \| awk 'NR>1' | cut -f 3 \> {1}.{2}.jaccard" \::: `ls *.merge.bed` ::: `ls *.merge.bed`

This command will create a single file containing the pairwise Jaccard

measurements from all 400 tests.

A bit of cleanup to use more intelligible names for each of the samples.

cat pairwise.dnase.txt \| sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' \| sed -e 's/.hg19//g' \> pairwise.dnase.shortnames.txt

Now let’s make a 20x20 matrix of the Jaccard statistic. This will allow

the data to play nicely with R.

awk 'NF==3' pairwise.dnase.shortnames.txt \| awk '$1 ~ /^f/ && $2 ~ /^f/' \| python make-matrix.py \> dnase.shortnames.distance.matrix

在广大粉丝的期待下，《生信宝典》联合《宏基因组》于4月14在北京鼓楼推出《ChIP-seq分析专题培训》，为大家提供一条走进生信大门的捷径、为同行提供一个ChIP-seq实战分析学习和交流的机会、助力学员真正理解分析原理和完成实战分析，独创线下集中授课2天+自行练习5天+再集中讲解答疑2天+后期学习群的四段式教学，并提供学习视频，教、学、练、答结合，真正实现独立分析大数据。

关于学习生物信息学分析的重要性，请阅读《生物信息9天速成班—成为团队中不可或缺的人》。

ChIP-seq基本分析流程见流程。课程将在这个基础上，提供更深入地分析指导。

课程介绍

座位按报名并成功缴费顺序从前到后龙摆尾式排序。赠送价值188元线上生信基础课程一门，目前的《应用Python处理生物信息数据和作图》、《生物信息作图系列R、Cytoscape及图形排版》和《生物信息中的Linux应用》任选其一。获赠32G品牌定制U盘(内含数据资料)。多人(N，10>N>1)组团报名并同时缴费，每人还可获得价值N百元的礼品(京东购物卡)。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。