KAICD / Sep 18 2020

div.ProseMirror

VAE 실습

설치

# 사용할 코드 베이스 깃에서 다운!git clone https://github.com/aspuru-guzik-group/chemical_vae.git

8.5s

Python

import tensorflow as tftf.__version__

3.8s

Python

'1.14.0'

# rdkit 동작 확인from rdkit import Chemfrom rdkit.Chem.Draw import IPythonConsoleChem.MolFromSmiles('c1ccccc1')

0.4s

Python

!pip install keras==2.0.6

8.7s

Python

Load libraries

# tensorflow backendfrom os import environenviron['KERAS_BACKEND'] = 'tensorflow'# vae stufffrom chemical_vae.chemvae.vae_utils import VAEUtilsfrom chemical_vae.chemvae import mol_utils as mu# import scientific pyimport numpy as npimport pandas as pd# rdkit stufffrom rdkit.Chem import AllChem as Chemfrom rdkit.Chem import PandasTools# plotting stuffimport matplotlib.pyplot as pltimport matplotlib as mplfrom IPython.display import SVG, display# 군집도 확인 및 시각화를 위한 차원축소 모듈from sklearn.decomposition import PCAfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.manifold import TSNE%config InlineBackend.figure_format = 'retina'%matplotlib inline

2.6s

Python

Load a model

zinc 데이터를 활용하여 사전 학습된 모델을 불러온다

# ZINC데이터 보기testdf = pd.read_csv('./chemical_vae/models/zinc_properties/250k_rndm_zinc_drugs_clean_3.csv')testdf

0.4s

Python

log P : 물-옥탄올 분배 계수(water-octanal partition coefficient)
- 농도 비율에 log를 취한 값을 보통 사용
- 이온화 되지 않은 화합물의 농도 비율을 logP
- 유기 용매의 극성이 높을 수록 P 값이 낮아지며,
  따라서 독성이 더 큼을 의미
- 단, 해당 코드에서는 두 객체 사이의 유사성을 판별하기 위하여 사용
- 2일차에 설명된 내용과 중복
QED : 약물 유사성의 정량적 추정(Quantitative Estimation of Drug-likeness)
- 논문 "Quantifying the chemical beauty of drugs"에 기반
- 0~1 사이의 값
- 분자량, logP, 토폴로지 극성 표면적, 수소 결합 공여체 및 수용체 수, 방향족 고리 및 회전 가능한 결합의 수, 원치 않는 화학적 기능의 존재를 포함한 분자 특성의 기본 분포를 반영
  - molecular weight, logP, topological polar surface area, number of hydrogen bond donors and acceptors, the number of aromatic rings and rotatable bonds, and the presence of unwanted chemical functionalities
- 화합물의 품질을 정량화하기 위해 선호도 개념을 적용
- 표적 기능과 관련된 장점에 따라 화학 구조의 순위를 매기는 수단
- 참조: https://www.rdkit.org/docs/source/rdkit.Chem.QED.html
SAS : 합성(종합) 접근성 점수(synthetic accessibility score)
- 논문 "Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions"와 "Prediction of Physicochemical Parameters by Atomic Contributions"에 기반
- SAscore = fragmentscore - complexitypenalty
- fragmentscore (fragment의 유사도를 측정)
- complexitypenalty (ringComplexityScore, stereoComplexityScore, macrocyclePenalty, sizePenalty 의 조합)
- PubChem 데이터베이스에서 934,046 개의 대표 분자를 fragment화 하여 계산
- 1-10 사이의 값으로 계산 가능
- 참조 : https://depth-first.com/articles/2010/10/28/predicting-synthetic-accessibility/

# 학습된 모델 불러오기vae = VAEUtils(directory='/chemical_vae/models/zinc_properties')

30.8s

Python

VAEUtil 모듈 확인

VAEUtil 클래스는 학습된 모델을 불러와 기타 계산을 처리해 주는 모듈로서 역할을 하는 중요한 부분

해당 부분 및 학습에 사용된 코드 전체 확인이 필요한 경우:

https://github.com/aspuru-guzik-group/chemical_vae/에서 코드 확인 가능

아래의 코드는 동작X

## vae_util 클래스 내의 함수 구성  def estimate_estandarization(self):    ## ppt에서 설명했던 가우시안 분포의 뮤(평균)값과, 시그마(표준편차)값을 도출하는 부분        print('Standarization: estimating mu and std values ...', end='')        # sample Z space        smiles = self.random_molecules(size=50000)        batch = 2500        Z = np.zeros((len(smiles), self.params['hidden_dim']))        for chunk in self.chunks(list(range(len(smiles))), batch):            sub_smiles = [smiles[i] for i in chunk]            one_hot = self.smiles_to_hot(sub_smiles)            Z[chunk, :] = self.encode(one_hot, False)        self.mu = np.mean(Z, axis=0)        self.std = np.std(Z, axis=0)        self.Z = self.standardize_z(Z)        print('done!')        return    ## 정규화와 비정규화(원상복귀)를 담당하는 부분    def standardize_z(self, z):        return (z - self.mu) / self.std    def unstandardize_z(self, z):        return (z * self.std) + self.mu    ## 랜덤한 변화점의 생성을 위한 노이즈 추가부분    def perturb_z(self, z, noise_norm, constant_norm=False):        if noise_norm > 0.0:            noise_vec = np.random.normal(0, 1, size=z.shape)            noise_vec = noise_vec / np.linalg.norm(noise_vec)            if constant_norm:                return z + (noise_norm * noise_vec)            else:                noise_amp = np.random.uniform(                    0, noise_norm, size=(z.shape[0], 1))                return z + (noise_amp * noise_vec)        else:            return z    # Now reports predictions after un-normalization.    def predict_prop_Z(self, z, standardized=True):        if standardized:            z = self.unstandardize_z(z)        # both regression and logistic        if (('reg_prop_tasks' in self.params) and (len(self.params['reg_prop_tasks']) > 0) and                ('logit_prop_tasks' in self.params) and (len(self.params['logit_prop_tasks']) > 0)):            reg_pred, logit_pred = self.property_predictor.predict(z)            if 'data_normalization_out' in self.params:                df_norm = pd.read_csv(self.params['data_normalization_out'])                reg_pred = reg_pred * \                    df_norm['std'].values + df_norm['mean'].values            return reg_pred, logit_pred        # regression only scenario        elif ('reg_prop_tasks' in self.params) and (len(self.params['reg_prop_tasks']) > 0):            reg_pred = self.property_predictor.predict(z)            if 'data_normalization_out' in self.params:                df_norm = pd.read_csv(self.params['data_normalization_out'])                reg_pred = reg_pred * \                    df_norm['std'].values + df_norm['mean'].values            return reg_pred        # logit only scenario        else:            logit_pred = self.property_predictor.predict(self.encode(z))            return logit_pred    # wrapper functions    def predict_property_function(self):        # Now reports predictions after un-normalization.        def predict_prop(X):            # both regression and logistic            if (('reg_prop_tasks' in self.params) and (len(self.params['reg_prop_tasks']) > 0) and                    ('logit_prop_tasks' in self.params) and (len(self.params['logit_prop_tasks']) > 0)):                reg_pred, logit_pred = self.property_predictor.predict(                    self.encode(X))                if 'data_normalization_out' in self.params:                    df_norm = pd.read_csv(                        self.params['data_normalization_out'])                    reg_pred = reg_pred * \                        df_norm['std'].values + df_norm['mean'].values                return reg_pred, logit_pred            # regression only scenario            elif ('reg_prop_tasks' in self.params) and (len(self.params['reg_prop_tasks']) > 0):                reg_pred = self.property_predictor.predict(self.encode(X))                if 'data_normalization_out' in self.params:                    df_norm = pd.read_csv(                        self.params['data_normalization_out'])                    reg_pred = reg_pred * \                        df_norm['std'].values + df_norm['mean'].values                return reg_pred            # logit only scenario            else:                logit_pred = self.property_predictor.predict(self.encode(X))                return logit_pred        return predict_prop

0.0s

Python

Using the VAE

Decode/Encode

기존 Deep Network에서는 learning rate를 너무 높게 잡을 경우 gradient가 explode/vanish 하거나, 나쁜 local minima에 빠지는 문제가 있었다. 이는 parameter들의 scale 때문인데, Batch Normalization을 사용할 경우 propagation 할 때 parameter의 scale에 영향을 받지 않게 된다. 따라서, learning rate를 크게 잡을 수 있게 되고 이는 빠른 학습을 가능케 한다.
Batch Normalization의 경우 자체적인 regularization 효과가 있다. 이는 기존에 사용하던 weight regularization term 등을 제외할 수 있게 하며, 나아가 Dropout을 제외할 수 있게 한다 (Dropout의 효과와 Batch Normalization의 효과가 같기 때문.) . Dropout의 경우 효과는 좋지만 학습 속도가 다소 느려진다는 단점이 있는데, 이를 제거함으로서 학습 속도도 향상된다.

vae.enc.summary()

0.5s

Python

#!apt install python-pydot python-pydot-ng graphviz#!pip install pip --upgrade

19.1s

Python

#!pip install pydot lightgbm graphviz

6.1s

Python

#import keras#import pydot as pyd#from IPython.display import SVG#from keras.utils.vis_utils import model_to_dot#keras.utils.vis_utils.pydot = pyd#from graphviz import Digraphdef visualize_model(model):  return SVG(model_to_dot(model, show_shapes="True", show_layer_names="True").create(prog='dot', format='svg'))#visualize_model(vae.enc)

0.7s

Python

vae.dec.summary()

0.4s

Python

#visualize_model(vae.dec)

0.7s

Python

vae.property_predictor.summary()

0.3s

Python

#visualize_model(vae.property_predictor)

0.3s

Python

# HyperParameters.py에서 설정한 파라미터 보기vae.params

0.0s

Python

smiles : 아래의 값을 바꾸어 입력하여 비교 가능

CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
O=C(CC1CCCCC1)NNC(=O)Cn1ccnc1
CNC(=O)c1ccc(OC(=O)c2cc(-c3ccco3)nc3ccccc23)cc1

smiles_1 = mu.canon_smiles('CSCC(=O)NNC(=O)c1c(C)oc(C)c1C')X_1 = vae.smiles_to_hot(smiles_1,canonize_smiles=True)z_1 = vae.encode(X_1)X_r= vae.decode(z_1)print('{:20s} : {}'.format('Input',smiles_1))print('{:20s} : {}'.format('Reconstruction',vae.hot_to_smiles(X_r,strip=True)[0]))print('{:20s} : {} with norm {:.3f}'.format('Z representation',z_1.shape, np.linalg.norm(z_1)))

0.9s

Python

property predictor

print('Properties (qed,SAS,logP):')y_1 = vae.predict_prop_Z(z_1)[0]print(y_1)

0.6s

Python

Decode several attempts

VAE are probabilistic

noise=5.0print('Searching molecules randomly sampled from {:.2f} std (z-distance) from the point'.format(noise))df = vae.z_to_smiles(z_1, decode_attempts=100, noise_norm=noise)print('Found {:d} unique mols, out of {:d}'.format(len(set(df['smiles'])),sum(df['count'])))print('SMILES\n',df.smiles)display(PandasTools.FrameToGridImage(df,column='mol', legendsCol='smiles',molsPerRow=5))df.head(10)#CSCC(=O)NNC(=O)c1c(C)oc(C)c1C

3.5s

Python

sampling

Sample random points from the training set along with properties

# 5만개의 데이터 랜덤 샘플링# encoding된 latentvector, property, smiles 쌍으로 생성Z, data, smiles = vae.ls_sampler_w_prop(size=50000,return_smiles=True)prop_opt = 'qed'prop_df = pd.DataFrame(data).reset_index()prop_df['smiles']=smiles

34.6s

Python

prop_df

0.8s

Python

차원 축소를 통한 데이터 시각화

군집도를 파악하기 위하여, 고차원 데이터셋(Latent Vector Space)을 육안으로 확인 가능한 2차원으로 축소

PCA of latent space

Perform a PCA projection and color the points based on a property

#!pip install scipy#!pip install scikit-learn

0.0s

Python

import platform; print(platform.platform())import sys; print("Python", sys.version)import numpy; print("NumPy", numpy.__version__)import scipy; print("SciPy", scipy.__version__)import sklearn; print("Scikit-Learn", sklearn.__version__)

0.4s

Python

# do pca and normalizeprop_opt = 'qed' # qed logP SASZ_pca = PCA(n_components=2).fit_transform(Z)Z_pca = MinMaxScaler().fit_transform(Z_pca)df = pd.DataFrame(np.transpose((Z_pca[:,0],Z_pca[:,1])))df.columns = ['x','y']df[prop_opt]=prop_df[prop_opt]plt.scatter(x=df['x'], y=df['y'], c=df[prop_opt],            cmap= 'viridis', marker='.',            s=10,alpha=0.5, edgecolors='none')# viridis는 색맹인 사람도 확인이 가능한 색의 분포로 color를 구성plt.show()

1.2s

Python

유사한 데이터 추출 후 기존 분포에서 확인 in PCA

제약된 수치는 큰 의미 없이 임의로 정한 값

약 10개 이하의 적은 개수의 데이터만 따로 보기 위해 정한 수치

# logP Property # 1.11 < x < 1.111 사이 값 추출tmp1 = prop_df[prop_df['logP']<1.111]tmp1 = tmp1[tmp1['logP']>1.11]tmp1_index = tmp1.indextmp1

Shift+Enter to run

Python

# QED Property # 0.571000 < x < 4.571100 사이 값 추출tmp2 = prop_df[prop_df['qed']<0.571100]tmp2 = tmp2[tmp2['qed']>0.571000]tmp2_index = tmp2.indextmp2

Shift+Enter to run

Python

# SAS Property # 2.1105 < x < 2.1110 사이 값 추출tmp3 = prop_df[prop_df['SAS']<2.111]tmp3 = tmp3[tmp3['SAS']>2.1105]tmp3_index = tmp3.indextmp3

Shift+Enter to run

Python

# do pca and normalize + Z_pca = PCA(n_components=2).fit_transform(Z)Z_pca = MinMaxScaler().fit_transform(Z_pca)prop_opt = 'qed' # qed logP SASdf = pd.DataFrame(np.transpose((Z_pca[:,0],Z_pca[:,1])))df.columns = ['x','y']df[prop_opt]=prop_df[prop_opt]plt.scatter(x=df['x'], y=df['y'], c=df[prop_opt],            cmap= 'viridis', marker='.',            s=10,alpha=0.5, edgecolors='none')# 붉은색으로 few dataset 분포 추가plt.scatter(x=df['x'][tmp1_index], y=df['y'][tmp1_index],             color= 'red', marker='.',            s=10,alpha=1, edgecolors='none')plt.show()

Shift+Enter to run

Python

T-SNE of Latent space

Z_tsne = TSNE(n_components=2).fit_transform(Z)Z_tsne = MinMaxScaler().fit_transform(Z_tsne)                                   

1607.1s

Python

new_df = pd.DataFrame(np.transpose((Z_tsne[:,0],Z_tsne[:,1])))new_df.columns = ['x','y']new_df[prop_opt]=prop_df[prop_opt]plt.scatter(x=new_df['x'], y=new_df['y'], c=new_df[prop_opt],            cmap= 'viridis', marker='.',            s=10,alpha=0.5, edgecolors='none')plt.show()

1.0s

Python

유사 데이터 추출 후 기존 분포에서 확인 in T-SNE

new_df = pd.DataFrame(np.transpose((Z_tsne[:,0],Z_tsne[:,1])))new_df.columns = ['x','y']prop_opt = 'qed' # qed logP SASnew_df[prop_opt]=prop_df[prop_opt]plt.scatter(x=new_df['x'], y=new_df['y'], c=new_df[prop_opt],            cmap= 'viridis', marker='.',            s=10,alpha=0.5, edgecolors='none')# 붉은색으로 few dataset 분포 추가plt.scatter(x=new_df['x'][tmp1_index], y=new_df['y'][tmp1_index], c=new_df[prop_opt][tmp1_index],            cmap= 'viridis', marker='.',            s=10,alpha=0.5, edgecolors='none')plt.show()

Shift+Enter to run

Python

VAE 실습

설치

Load libraries

Load a model

VAEUtil 모듈 확인

Using the VAE

Decode/Encode

property predictor

Decode several attempts

sampling

차원 축소를 통한 데이터 시각화

PCA of latent space

유사한 데이터 추출 후 기존 분포에서 확인 in PCA

T-SNE of Latent space

유사 데이터 추출 후 기존 분포에서 확인 in T-SNE

Runtimes (1)