Welcome to GPRS documentation!

About gprs package

This package aims to generate the PRS model from GWAS summary statistics. It is designed to deal with the data format based on the GWAS catalog and GeneATLAS database GWAS summary statistics data.

  • Understanding the workflow of this package:

  1. Filter GWAS summary statistics files (remove duplicate SNPID and select significant SNPs by P-value)

  2. Generate bfiles by Plink1.9

  3. Do clumping by Plink1.9

  4. Generate PRS model by Plink2.0

  5. Calculate statistic value of PRS model

Environment setup

  1. Setup virtualenv

$ python3 -m venv venv
  1. Activate virtualenv

$ source ./venv/bin/activate
  1. Install this package

$ pip install -e .
  • Twelve commands in gprs:

  1. geneatlas-filter-data

  2. gwas-filter-data

  3. generate-plink-bfiles

  4. clump

  5. select-clump-snps

  6. build-prs

  7. combine-prs

  8. prs-statistics

  9. combine-prs-stat

  10. transfer_atcg (optional)

  11. sub-setpop (optional)

  12. generate_plink_bfiles_w_individual_info (optional)

  13. random_draw_samples_from_fam (optional)

  14. subset_vcf_w_random_sample (optional)

Get started: understanding the data format

GWAS template

Chr Pos RSID Allele1 Allele2 Freq1 Effect StdErr P-value n_total_sum
1 10539 rs537182016 a c 0.0013 -5.1213 20.0173 0.7981 7043
1 11008 rs575272151 c g 0.9215 0.1766 0.2610 0.4985 7042.99
1 11012 rs544419019 c g 0.9215 0.1766 0.2610 0.4985 7042.99
1 14674 rs561913721 a g 0.0032 0.8040 0.8364 0.3364 7043

GeneAtlas template

SNP ALLELE NBETA-selfReported_n_1526 NSE-selfReported_n_1526 PV-selfReported_n_1526
rs1110052 T -0.00032386 0.000269 0.2286
rs112164716 T 6.5121e-05 0.001126 0.95388
rs11240779 A -0.00022416 0.00028937 0.43855
rs11260596 C 0.00025168 0.00024248 0.29931

How to choose model template?

- chr info in the file name chr info not in the file name
chr info in the header gene_atlas_model gwas_model
chr info absent in the header gene_atlas_model gene_atlas_model

Before step1:

Please unzip your .gz file first.

Step1: Unify the data format

After knowing the data format, users can choose the model (gwas or geneatlas) to unify the data format and filter out SNPs(optional). :heavy_exclamation_mark: SNPs are extract out by RSID not chromosome position

Function: gprs geneatlas-filter-data

Filter GeneAtlas csv file by P-value and unify the data format as following order: SNPID, ALLELE, BETA, StdErr, Pvalue

How to use it?

Shell:

$ gprs geneatlas-filter-data --ref [str] --data_dir [str] --result_dir [str] --snp_id_header [str] --allele_header [str] --beta_header [str] --se_header [str] --pvalue_header [str] --pvalue [float/scientific notation] --output_name [str]  
$ gprs gwas-filter-data --ref [str] --data_dir [str] --result_dir [str] --snp_id_header [str] --allele_header  [str] --beta_header [str] --se_header [str] --pvalue_header [str] --pvalue [float/scientific notation] --output_name [str]  

Python:

from gprs.gene_atlas_model import GeneAtlasModel
if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )

    geneatlas.filter_data( snp_id_header='MarkerName',
                            allele_header='Allele1',
                            beta_header='b',
                            se_header ='SE',
                            pvalue_header='p',
                            output_name='2014height')
   
from gprs.gwas_model import GwasModel
if __name__ == '__main__':
    gwas = GwasModel( ref='/home1/ylo40816/1000genomes/hg19',
                 data_dir='/home1/ylo40816/Projects/GPRS/data/2019_GCST008970')

    gwas.filter_data( snp_id_header='RSID',
                   allele_header='Allele1',
                   beta_header='Effect',
                   se_header='StdErr',
                   pvalue_header='P-value',
                   output_name='GCST008970',
                   file_name='gout_chr1_22_LQ_IQ06_mac10_all_201_rsid.csv')

output files

  • *.QC.csv (QC files )

  • *.csv (snplist)

Setp 3-1: Clumping (remove linked SNPs)

Function: gprs clump

This option encodes plink1.9 clump function

plink --bfile [bfiles] --clump [qc snpslists] --clump-p1  --clump-p2  --clump-r2  --clump-kb  --clump-field  --clump-snp-field  --out 

The plink_bfiles_dir, qc snpslists and clump_output_dir will automatically be filled in the script. Users have to indicate the options below.

How to use it?

Shell:

$ gprs clump --ref [str] --data_dir [str] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_p2 [float/scientific notation] --clump_r2 [float] --clump_field [str] --clump_snp_field [str] --plink_bfile_name [str] --qc_file_name [str] --output_name [output name]

Python:

from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )

    geneatlas.clump(output_name='2014height',
                    clump_kb='250',
                    clump_p1='0.02', clump_p2='0.02',
                    qc_file_name='2014height',
                    plink_bfile_name='2014height')


output files

  • *.clump

CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001 SP2
1 1 rs1967017 145723645 3.72e-16 19 0 2 6 3 8 rs11590105(1),rs17352281(1),rs9728345(1),rs11587821(1)
1 1 rs760077 155178782 7.45e-10 12 0 2 2 1 7 rs11589479(1),rs3766918(1),rs4625273(1),rs4745(1),rs12904(1)

Setp 3-2: Filter SNPs depends on .clump

After clumping, we have to filter SNPs again, to remove linked SNPs. In this step, we will have new SNPs list, and use it for generate PRS model.

Function: gprs select-clump-snps

How to use it?

Shell:

$ gprs select-clump-snps --result_dir [str] --qc_file_name [str] --clumpfolder_name [str] --clump_file_name [str] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float] --output_name [output name]

Python:

from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )

    geneatlas.select_clump_snps(output_name='2014height',clump_file_name='2014height',
                           qc_file_name='2014height',clumpfolder_name='',clump_kb='250',
                    clump_p1='0.02', clump_r2='0.02')

output files

  • *.qc_clump_snpslist.csv

With Chromosome information:

CHR SNP Allele Beta SE Pvalue
1 rs1967017 A -0.0736 0.0315 0.01938
1 rs760077 T -0.1603 0.0543 0.003139

Without Chromosome information:

SNPID Allele Beta SE Pvalue
9:98316375:A:G A -0.0736 0.0315 0.01938
9:105570921:T:C T -0.1603 0.0543 0.003139

Setp 4-1: Generate PRS model

Generate PRS model by using Dosage by plink2.0

Function: gprs build-prs

plink2 --vcf [vcf input] dosage=DS --score [snplists afte clumped and qc]  --out 

The clumped qc snpslists and prs_output_dir will automatically be filled in the script. Users have to indicate the options below.

How to use it?

Shell:

$ gprs build-prs --vcf_input [str] --qc_clump_snplist_foldername [str] --symbol [str/int] --columns [int] --plink_modifier [str] --memory [int] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float] --output_name [output name]

Python:

from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )

    geneatlas.build_prs( vcf_input= '1000genomes/hg19',
                          output_name ='2014height', memory='1000',clump_kb='250',
                    clump_p1='0.02', clump_r2='0.02', qc_clump_snplist_foldername='2014height')

output files

  • *.sscore

IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM
HG00096 130 116
HG00097 130 114
HG00099 130 119
HG00100 130 110

Setp 4-2: Combined PRS model

Combined PRS model (python script create by Soyoung Jeon; update by Ying-Chu Lo))

Function:gprs combine-prs

Combine-prs will combine all .sscore files as one .sscore file. And calculate score average and sum per individual.

How to use it?

Shell:

$ gprs combine-prs --ref [str] --result_dur [str] 

Python:

from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )

    geneatlas.combine_prs(filename="2014height",clump_r2="0.5",clump_kb="250",clump_p1="0.02")

output files

  • *.sscore

id ALLELE_CT SCORE_AVG SCORE_SUM
HG03270 1872.0 -0.00109666512273 -2.2826939078
HG03271 1872.0 -0.00111419935 -2.2831272058
NA19670 1872.0 -0.00117191923182 -2.4014961794
HG03279 1872.0 -0.00115016386364 -2.3057819

Setp 5-1: Calculate statistics value of PRS model

(Rscript create by Soyoung Jeon; update by Ying-Chu Lo)

Function: prs-statistics

After obtained combined sscore file, prs-statistics calculates BETA, AIC, AUC, PseudoR2 and OR ratio

How to use it?

Shell:

$ gprs prs-statistics --score_file [str] --pheno_file [str] --data_set_name [str] --prs_stats_R [str] --r_command [str] --output_name [str] 

Python:

from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )
    
    geneatlas.prs_statistics(output_name='2014height', score_file = "/home1/ylo40816/Projects/GPRS/tmp/2014height_250_0.02_0.1.sscore",
        pheno_file = "Projects/GPRS/tmp/result/plink/prs/2014height_pheno.csv",
        r_command='/spack/apps/linux-centos7-x86_64/gcc-8.3.0/r-4.0.0-jfy3icn4kexk7kyabcoxuio2iyyww3o7/bin/Rscript',
        prs_stats_R="Projects/GPRS/gprs/prs_stats_quantitative_phenotype.R", data_set_name="2014height",
                             clump_kb='250',
                             clump_p1='0.02',
                             clump_r2='0.1'
                             )

output files

  • *_stat.txt

data filter_condition snps_nb P BETA Degree_of_freedom PseudoR2 OR_top1_to_middle20 OR_top2_to_middle20 OR_top5_to_middle20 OR_top10_to_middle20
2014height 250_0.02_0.1 143 0.5632 0.0019 1007 0.0002 0.5396 1.3899 0.9408 1.2972

Setp 5-2: Combined PRS statistics results (Optional)

Function:combine-prs-stat

If you have more than one trained PRS model, combine-prs-stat function is designed to combine statistics results. For instance: the first PRS model was filtered with P < 0.05, the second PRS model was filtered with P < 0.0005. You will have DATA_0.05_stat.txt/DATA_0.0005_stat.txt Combining two statistic tables allows users easy to compare between PRS models

How to use it?

Shell:

$ gprs combine-prs-stat --data_set_name [str]

Python:

from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )

    geneatlas.combine_prs_stat(data_set_name='2014height')

output files

  • *_combined_stat.txt

data filter_conditions P BETA PseudoR2 OR_top1_to_middle20 OR_top2_to_middle20 OR_top5_to_middle20 OR_top10_to_middle20
2014height 250_0.02_0.1 0.5632 0.0019 0.0002 0.5396 1.3899 0.9408 1.2972
2014height 250_0.02_0.1 0.5632 0.0019 0.0002 0.5396 1.3899 0.9408 1.2972

Example of giant height from Wood et al 2014

This is an example from Wood et al. The example aimed to use the GPRS package to replicate the Fig4A in Wood et al paper.

Get started: 2014 height data structure

The GWAS summary statistics were downloaded from the GIANT database.

  • Data name: GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt

The data structure:

MarkerName Allele1 Allele2 Freq.Allele1.HapMapCEU b SE p N
rs4747841 A G 0.551 -0.0011 0.0029 0.70 253213
rs4749917 T C 0.436 0.0011 0.0029 0.70 253213
rs737656 A G 0.367 -0.0062 0.0030 0.042 253116
rs737657 A G 0.358 -0.0062 0.0030 0.041 252156
rs7086391 T C 0.12 -0.0087 0.0038 0.024 248425

Which template to use?

Before starting to process the data, we have to choose which template to use. Please use the table below to select the template.

- chr info in the file name chr info not in the file name
chr info in the header gene_atlas_model gwas_model
chr info absent in the header gene_atlas_model gene_atlas_model
  • The chromosome information is absent in 2014 height data, and 2014 height data also has no header with Chromosome information. Thus, I choose gene_atlas_model as a template

Output folders

In the GPRS package, the result folder will automatically generate under the execution directory. i.e. The user run GPRS package in /home/user/ then the default result directory is /home/user/result

The structure of output folders are: result/plink result/qc result/snplists result/stat result/plink/bfiles result/plink/clump result/plink/prs result/plink/qc_and_clump_snpslist

Step1 preparing data set - unified the data format

In step one, the filter_data function will filter raw data with p-value, the default is 1 and gives you three output files: snplist, qc'd file and summary After preparing the data, the snplist: result/snplists/[OUTPUT_NAME].csv will be used to generate plink files.

The qc’d file result/qc/[OUTPUT_NAME].QC.csv is a unified file header to |SNPID| Allele| Beta| SE |Pvalue| and this file will be used in the clumping step. The summary file records the information of the SNPs number before and after data preparation.

$ gprs geneatlas-filter-data --ref [str] --data_dir [str] --result_dir [str] --snp_id_header [str] --allele_header [str] --beta_header [str] --se_header [str] --pvalue_header [str] --pvalue [float/scientific notation] --output_name [str]  
  • In real use:

$ gprs geneatlas-filter-data --data_dir data/2014_GWAS_Height --result_dir [str] --snp_id_header MarkerName --allele_header Allele1 --beta_header b --se_header SE --pvalue_header p --pvalue 1 --output_name 2014height
from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )
    
    geneatlas.filter_data( snp_id_header='MarkerName',
                            allele_header='Allele1',
                            beta_header='b',
                            se_header ='SE',
                            pvalue_header='p',
                            output_name='2014height')   

After preparing the data-set three files obtained

The output folder will automatically generate and named as result

  • result/snplists/2014height.csv

SNPID
rs737656
rs737657
rs7086391
  • result/qc/2014height.QC.csv

SNPID Allele Beta SE Pvalue
rs737656 A -0.0062 0.003 0.042
rs737657 A -0.0062 0.003 0.041
rs7086391 T -0.0087 0.0038 0.024
  • summary file: /result/qc/[output_name].filteredSNP.withPvalue0.05.summary.txt

no header
2014height: Total SNP BEFORE FILTERING: 2157028
2014height: Total SNP AFTER FILTERING: 395061

Step3.1 remove linked SNPs

In step3, we are trying to find the linked SNPs and remove them for further analysis. The package uses the plink clump function to find the linked SNPs. The r2 value in the clump function is 0.1; if the user wants to apply another value, it should be specified in the command. The function --clump_field is asking the user to enter the column name, the default here is Pvalue (from the Step1 qc'd output )

$ gprs clump --plink_bfile_name [str] --output_name [str] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_p2 [float/scientific notation] --clump_r2 [float] --clump_field [str] --qc_file_name [str] --clump_snp_field [str]   
  • In real use:

$ gprs clump --data_dir data/2014_GWAS_Height --clump_kb 250 --clump_p1 0.02 --clump_p2 0.02 --clump_r2 0.1 --clump_field Pvalue --clump_snp_field 2014height --plink_bfile_name 2014height --qc_file_name 2014height --output_name 2014height
from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )
    
    geneatlas.clump(output_name='2014height',plink_bfile_name='2014height',
                    qc_file_name='2014height',clump_kb='250',
                    clump_p1='0.02',clump_p2='0.02', clump_r2='0.02')
  • chr1-chr22 clumped files obtained

result/plink/clump/*.clumped

CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001 SP2
1 1 rs1967017 145723645 3.72e-16 19 0 2 6 3 8 rs11590105(1),rs17352281(1),rs9728345(1),rs11587821(1)
1 1 rs760077 155178782 7.45e-10 12 0 2 2 1 7 rs11589479(1),rs3766918(1),rs4625273(1),rs4745(1),rs12904(1)

Step3.2 select clumped SNPs

After clump, we will receive a list of SNPs, and we filter out the original qc’d file to generate a new SNPs list. From this step, users have to provide C+T (clumping + threshold) conditions as a marker in the output file name.

$ gprs select-clump-snps --qc_file_name [str] --clump_file_name [str] --output_name [output name] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float] --clumpfolder_name [str]
  • In real use:

$ gprs select-clump-snps --qc_file_name 2014height --clump_file_name 2014height --clump_kb 250 --clump_p1 0.02 --clump_r2 0.1 --clumpfolder_name 2014height --output_name 2014height
from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )
    
    geneatlas.select_clump_snps(output_name='2014height',
                                clump_file_name='2014height',
                                qc_file_name='2014height',
                                clump_kb='250',
                                clump_p1='0.02',
                                clump_r2='0.5',clumpfolder_name='2014height')
  • new snps list obtained

result/plink/clump/*.qc_clump_snpslist.csv

CHR SNP
1 rs1967017
1 rs760077

Step4.1 Generate PRS model

In step4 the function build-prs is built on Plink2.0 dosage.

$ gprs build-prs --vcf_input [str] --output_name [str] --qc_clump_snplist_foldername [str] --memory [int] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float] --symbol [str/int] --columns [int] --plink_modifier [str]  
  • In real use:

$ gprs build-prs --vcf_input 1000genomes/hg19 --symbol . --qc_file_name 2014height --columns 1 2 3 --memory 1000 --clump_kb 250 --clump_p1 0.02 --clump_r2 0.1 --output_name 2014height 
from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )
    
    geneatlas.build_prs( vcf_input= 'home/1000genomes/hg19',
                         output_name ='2014height', memory='1000',clump_kb='250',
                         clump_p1='0.02', clump_r2='0.02', qc_clump_snplist_foldername='2014height')
  • chr1-chr22 sscore files obtained

result/plink/prs/*.sscore

IID NMISS_ALLELE_CT NAMED_ALLELE_DOSAGE_SUM
HG00096 1912 889
HG00097 1912 884
HG00099 1912 875

Step4.2 Combined PRS model

From step 4.1 we will have 1-22 chromosomes sscore files, in this step we are going to combine these files as one output.

$ gprs build-prs --vcf_input [str] --symbol [str/int] --qc_file_name [str] --columns [int] --plink_modifier [str] --memory [int] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float] --output_name [output name] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float]
  • In real use:

$ gprs build-prs --vcf_input 1000genomes/hg19 --symbol . --qc_file_name 2014height --columns 1 2 3 --memory 1000 --clump_kb 250 --clump_p1 0.02 --clump_r2 0.1 --output_name 2014height 
from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )
    
    geneatlas.combine_prs(filename="2014height",
                          clump_r2="0.5",clump_kb="250",clump_p1="0.02")
  • one combined sscore files obtained

result/plink/prs/*.sscore

id ALLELE_CT SCORE_AVG SCORE_SUM
HG00096 59636.0 0.0009896786363636364 60.676823334000005
HG00097 59636.0 0.0009901211363636362 60.46440781399999
HG00099 59636.0 0.0010224948181818182 63.233378824

Visualize distribution of PRS in each population

# Libraries
library(ggplot2)
library(dplyr)

data <- read.csv("combine_profil_w_pop.txt", sep=" ")

data$SCORE_Z <- (data$SCORE-mean(data$SCORE))/sd(data$SCORE) 

ggplot(data, aes(x=data$SCORE_Z, group=data$super_pop, fill=data$super_pop)) +
  geom_density(adjust=1.25, alpha=.7) +
  scale_fill_manual(values=c("brown2","springgreen3","mediumorchid","deepskyblue3","orange"))+
  labs(x = "Polygenic Score", y = "Density", fill = "Super population")+
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

PRS distribution plot

Example of MEC American Africans

This is an example of MEC American Africans

Under construction

GPRS

GeneAtlasModel

GwasModel

Indices and tables