【NGS with Data Science】Use bioinfokit to make a volcano plot

In this article, I will explain how to use Python to draw a Volcano plot. The Volcano plot was a standard tool for illustrating the differential analysis of gene expression analysis during the downstream of quantitative analysis, which was a special type of scatter plot. The x-axis of the Volcano plot was the log2 Fold change of two conditions, and the y-axis was the negative value of the negative logarithm base 10 of the p-value. Each data point represents a gene or other biological feature being tested. In RNA-Seq, each data point would be the genes.
這裡介紹使用Python的套件bioinfokit來繪製火山圖,首先我們需要下載範例資料以及安裝bioinfokit :
pip install bioinfokit
wget https://zenodo.org/record/2529117/files/limma-voom_luminalpregnant-luminallactate
wget https://zenodo.org/record/2529117/files/volcano_genes
稍微看一下資料的情況
head limma-voom_luminalpregnant-luminallactate
輸出:
ENTREZID SYMBOL GENENAME logFC AveExpr t P.Value adj.P.Val
12992 Csn1s2b casein alpha s2-like B -8.603611114762 3.56295004142591 -43.7964980711435 3.83064977005569e-15 6.05395889659601e-11
13358 Slc25a1 solute carrier family 25 (mitochondrial carrier, citrate transporter), member 1 -4.12417532129173 5.77969894403042 -29.9078492752674 1.75859473379618e-13 1.38964155864574e-09
11941 Atp2b2 ATPase, Ca++ transporting, plasma membrane 2 -7.38698638678659 1.28214314739647 -27.8194991746381 4.83636254056037e-13 2.43279979019347e-09
20531 Slc34a2 solute carrier family 34 (sodium phosphate), member 2 -4.17781242057656 4.27862903538987 -27.0727230566646 6.15742796809282e-13 2.43279979019347e-09
100705 Acacb acetyl-Coenzyme A carboxylase beta -4.3143199499725 4.4409137064501 -25.2235657685746 1.4999774593805e-12 4.74112875360987e-09
13645 Egf epidermal growth factor -5.36266382024579 0.73590465313728 -24.5993025952199 2.11624444834827e-12 5.57418787694935e-09
230810 Slc30a2 solute carrier family 30 (zinc transporter), member 2 -3.20311813582619 2.69581147606106 -23.80427759327 3.02466813499713e-12 6.82883645792781e-09
68801 Elovl5 ELOVL family member 5, elongation of long chain fatty acids (yeast) -2.86330403668687 6.45520453127796 -22.3535751591657 6.5987442593808e-12 1.30358192844068e-08
19659 Rbp1 retinol binding protein 1, cellular 5.44304402150002 6.10703318183881 21.0523576693295 1.47914323855668e-11 2.36474559807046e-08
limma-voom_luminalpregnant-luminallactate是一個tsv格式的下游分析結果。縱向是基因,橫向則是每個基因的變化量(這個案例是logFC),以及P值和校正後P值(adj.P.Val)
再看一下volcano_genes
head volcano_genes
GeneID
Mcl1
Hbegf
Tgfb2
Cxcl16
Csf1
Pdgfb
Edn1
Lif
Kitl
它是等一下要標註上名稱的候選基因名稱的列表
接著我們使用pandas把tsv讀成dataframe,並傳入visuz.GeneExpression.volcano這個函式就可以了
from bioinfokit import analys, visuz
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("limma-voom_luminalpregnant-luminallactat", sep='\t')
with open('volcano_genes') as f:
genes = f.read().splitlines()
df.dropna(inplace=True)
visuz.GeneExpression.volcano(df=df, lfc='logFC', pv='adj.P.Val', lfc_thr=(1, 1), sign_line=True, plotlegend=True, legendpos='upper right', legendanchor=(
1.46, 1), dotsize=0.1, gfont=7, xlm=(-10.0, 10.0, 1), pv_thr=(0.05, 0.05), genenames=tuple(genes), geneid="SYMBOL")
執行之後會在程式得得目錄下輸出圖檔

以上就是繪製火山圖的介紹。
A passionate bioinformatician focuses on the next generation of medical science and biotechnology.