NGS-based methods and Data Science

【NGS with Data Science】Gene2vec distributed representation of genes pipeline reproduction

This article explains how to use the pipeline in the paper "Gene2vec: distributed representation of genes based on co-expression." and re-train the model with your data.

Install pipeline

First, we create the path to download the package . We created a conda virtual environment of Python 3.7 as following commands:

Bash
mkdir gene2vec_test
cd gene2vec_test/
conda create -n gene2vec_test_env python=3.7
conda activate gene2vec_test_env

Then download the package with git command:

Bash
git clone https://github.com/jingcheng-du/Gene2vec.git
cd Gene2vec/

Because there are differences in parameter names between gensim 3.x and 4, if you want to use gemsim with 3.x version, remember to change requirements.txt: change gensim>=3.4.0 to gensim==3.4.0. (If you want to use gensim with 4.x version you need to modify gene2vec.py, such as changing the "size" of the word2vec object to "vector_size", etc.)

Bash
pip install -r requirements.txt

After the installation program is completed, we can test it as the command:

Bash
cd src/
python gene2vec.py 

If the following message appears, the installation should be good.

Bash
usage: gene2vec.py [-h] N [N ...]

Training Models

Then we could use the testing data to run the example :

Bash
python gene2vec.py ../data/ ../out txt

The gene vector would saved at the running path and here were some example data:

Bash
head -n 2 outgene2vec_dim_100_iter_1.txt
FGF6	0.002153057 0.001094015 0.0040485994 0.003507802 -0.0034948308 -0.0032273065 0.002758114 -0.0044576144 0.002355916 0.0017780543 0.004745546 0.0018376901 -0.0035088449 -0.0005739574 -0.000108827386 0.002103943 -0.0038852852 0.0012951874 -0.0031769034 -0.004375249 0.004074314 -0.0026881285 0.004214152 -0.004282877 0.0022233215 0.004169825 0.00061325595 2.0139367e-05 -0.0016913096 0.0025811284 0.0031880501 -0.0019990925 -0.0047910786 0.002188197 0.0026727102 -0.0006805879 0.00019095051 0.0010278132 0.0017754859 0.0031797176 -0.003708027 -0.0043337652 0.0035265626 -0.0008643125 0.00084504695 -0.00054039893 0.0003750502 -0.0037928058 -0.0042927195 -0.0047074244 -0.0017722481 0.00025958134 -0.0026379086 0.00018871028 -0.0019723917 -0.00021585514 0.0033635853 0.0022829815 -0.0024485104 0.0011425553 0.003241704 0.0047381823 0.0012685822 0.0041412427 0.0019761408 0.0019880526 0.0039201365 0.0013327249 -0.002263571 -0.0044547706 0.0037608626 0.00095062394 -0.00030630908 0.0031630904 -0.0018972668 0.004344254 0.0025073248 0.0037321039 -0.004189576 0.0025266777 0.0005846647 0.0019490473 0.0018105969 -0.004199487 0.0020253006 -0.0017606984 -0.004815944 0.0046018823 0.0042982115 0.00051282457 -0.0009345786 0.003392324 -0.0032844574 0.0011845101 -0.0011895953 -0.0012602699 -0.00042309787 0.004582391 0.0025786795 -0.0024350516 
GFI1B	-0.0029350498 0.0043180487 -0.004318311 0.0019120751 -0.0038370104 -0.00023128637 -0.004420749 -0.0035758333 -0.0040116534 0.0012707855 -0.0009630754 0.0004477923 0.0020208724 -0.00041648198 0.003939566 -0.0040858993 -0.004756729 0.0018472039 -0.0021072265 0.002428173 -0.00014559152 0.0045682737 -0.0033070655 -0.0035072211 0.00053472363 -0.0026147643 0.00052187295 0.0034156216 -0.0035089792 0.001963524 -0.0040159533 0.0029510746 0.004897053 0.0017880275 0.0009832341 -0.004501591 -0.0021778357 0.002407189 0.000616764 -0.003227798 -0.0042902012 0.0024847183 0.003374102 0.002082069 0.001478934 0.0048288074 0.0042617135 -0.0018422379 0.0039390987 0.00026498176 -0.00028904268 0.0011463418 0.0027650178 0.0037835115 0.0007013022 0.004905474 0.0006962089 0.0002940799 0.0038201583 -0.0031658853 -0.00292867 -0.00054527074 0.004884007 0.002188833 0.00015647558 -0.002252723 0.0020673836 0.0038181976 0.00041569016 -0.003276892 -0.002797324 0.0020927635 0.0010414731 -0.004298761 0.002510277 -0.0017390802 0.00439754 -0.0042876415 -0.00071369467 0.002830168 -0.0037963414 0.0036242604 0.00023945107 0.004529737 0.001412234 -0.0010020512 0.0044706156 0.0015063612 0.0029264004 -0.0043842485 -0.0016326424 0.0022118944 0.00042738195 -0.004558031 -0.003733534 -0.0029223813 0.0048098615 -0.0019367941 0.00491898 -0.0025868856 

In addition, we could use the pre-trained model and run the tSNE, for example:

Bash
pip install MulticoreTSNE
pip  install scikit-learn

python tsne_multi_core.py

and use plot.py to plot it :

Python
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df=pd.read_csv("TSNE_data_gene2vec.txt_100.txt",sep=" ",header=None)
df.columns = ['x', 'y']
plt.scatter(x=df["x"],y=df["y"])
plt.show()

than could get the 2d projection of the genes.

Each data point was the representation of a gene. Now, we have roughly completed the conversion of gene names to vectors for decentralized representation. This method can enhance the prediction capabilities of other biological markers .

If we trace the source code of gene2vec.py, can see that the training method is similar to the general NLP word vector method. The idea was only to change the general NLP input to the gene list of the GSEA data set.

References

  • Du, J., Jia, P., Dai, Y. et al. Gene2vec: distributed representation of genes based on co-expression. BMC Genomics 20 (Suppl 1), 82 (2019). https://doi.org/10.1186/s12864-018-5370-x

Leave a Reply

Your email address will not be published. Required fields are marked *

en_USEnglish