【NGS 次世代基因體資料科學】t-SNE簡介

何謂 t-SNE
t-SNE 是一種非線性降維技術。它會將在高維空間中相似的資料點在低維空間中放在一起,從而將資料從高維度映射到低維並保留資料分佈的局部特性。
t-SNE用機率分佈來描述一個資料點跟其它點的鄰近的機率是多少,並且它以最小化高維資料和低維度資料的機率分佈的差距(衡量高維和低維空間的資料分佈的差距的方式為KL divergence,而最佳化的演算法為梯度下降法)來達成降維。
由於對高維空間中較不相似的資料,在低維空間的映射會比原本的更遠,而比較為相似的資料,則是會變得比原本的還要更近。因此對於已知分群的資料,若使用t-SNE拿來作為資料QC的用途,在視覺化呈現上會更清楚,它可以很好的解決降維之後資料點混疊在一起的問題。
不過也正因為如此,不太能拿t-SNE後的距離去算群落之間的關係。但一般而言,拿t-SNE後的資料來判斷局部的資料點是否屬於某群落(cluster)則是可以的。
t-SNE 實作
這裡用Python搭配MNIST數據集作為示範,首先我們把示範用資料(https://pjreddie.com/projects/mnist-in-csv/#google_vignette)下載回來:
ShellScript
wget https://pjreddie.com/media/files/mnist_train.csv
稍微一下資料格式
ShellScript
head mnist_train.csv -n 1
輸出如下:csv的每筆資料的第一個值是label,後面則是每個像素的亮度(0-255)
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,18,18,18,126,136,175,26,166,255,247,127,0,0,0,0,0,0,0,0,0,0,0,0,30,36,94,154,170,253,253,253,253,253,225,172,253,242,195,64,0,0,0,0,0,0,0,0,0,0,0,49,238,253,253,253,253,253,253,253,253,251,93,82,82,56,39,0,0,0,0,0,0,0,0,0,0,0,0,18,219,253,253,253,253,253,198,182,247,241,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,80,156,107,253,253,205,11,0,43,154,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14,1,154,253,90,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,139,253,190,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,11,190,253,70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,35,241,225,160,108,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,81,240,253,253,119,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,45,186,253,253,150,27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,93,252,253,187,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,249,253,249,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,46,130,183,253,253,207,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,39,148,229,253,253,253,250,182,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,24,114,221,253,253,253,253,201,78,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,66,213,253,253,253,253,198,81,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,18,171,219,253,253,253,253,195,80,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,55,172,226,253,253,253,253,244,133,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,136,253,253,253,212,135,132,16,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
用sklean來跑:
Python
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import pandas as pd
from argparse import ArgumentParser
import numpy as np
from sklearn.manifold import TSNE
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
data = pd.read_csv('mnist_train.csv',header=None)
label_col=0
label_types = list(set(data.loc[:,label_col]))
label_types_numbers=len(label_types)
label_to_color = {0:'red',
1:'darksalmon',
2:'orange',
3:'mediumseagreen',
4:'darkviolet',
5:'forestgreen',
6:'maroon',
7:'blue',
8:'dimgrey',
9:'lightblue',
}
data_labels=data.loc[:,label_col]
data_colors = [label_to_color[label] for label in data_labels]
X = data.loc[:, data.columns != label_col]
model=TSNE()
X_reduction = model.fit_transform(X.values)
X_out = pd.DataFrame(X_reduction)
X_out['data_colors']=data_colors
X_out['data_labels']=data_labels
print(X_out)
X_groups=X_out.groupby('data_labels')
for gname,gdf in X_groups:
print(gdf)
plt.scatter(gdf.loc[:,0],gdf.loc[:,1],color=label_to_color[gname],label=str(gname),s=1.3)
plt.legend()
plt.show()
這裡我們把每筆資料都分配到它對應的顏色,並跑完tsne後的結果

可以看到0-9十種數字大約分成十群,雖然大區的色塊內還是多少有一些不同顏色的小點。
A passionate bioinformatician focuses on the next generation of medical science and biotechnology.