Get started: understanding the data format

GWAS template

Chr Pos RSID Allele1 Allele2 Freq1 Effect StdErr P-value n_total_sum
1 10539 rs537182016 a c 0.0013 -5.1213 20.0173 0.7981 7043
1 11008 rs575272151 c g 0.9215 0.1766 0.2610 0.4985 7042.99
1 11012 rs544419019 c g 0.9215 0.1766 0.2610 0.4985 7042.99
1 14674 rs561913721 a g 0.0032 0.8040 0.8364 0.3364 7043

GeneAtlas template

SNP ALLELE NBETA-selfReported_n_1526 NSE-selfReported_n_1526 PV-selfReported_n_1526
rs1110052 T -0.00032386 0.000269 0.2286
rs112164716 T 6.5121e-05 0.001126 0.95388
rs11240779 A -0.00022416 0.00028937 0.43855
rs11260596 C 0.00025168 0.00024248 0.29931

How to choose model template?

- chr info in the file name chr info not in the file name
chr info in the header gene_atlas_model gwas_model
chr info absent in the header gene_atlas_model gene_atlas_model

Before step1:

Please unzip your .gz file first.

Step1: Unify the data format

After knowing the data format, users can choose the model (gwas or geneatlas) to unify the data format and filter out SNPs(optional). :heavy_exclamation_mark: SNPs are extract out by RSID not chromosome position

Function: gprs geneatlas-filter-data

Filter GeneAtlas csv file by P-value and unify the data format as following order: SNPID, ALLELE, BETA, StdErr, Pvalue

How to use it?

Shell:

$ gprs geneatlas-filter-data --ref [str] --data_dir [str] --result_dir [str] --snp_id_header [str] --allele_header [str] --beta_header [str] --se_header [str] --pvalue_header [str] --pvalue [float/scientific notation] --output_name [str]  
$ gprs gwas-filter-data --ref [str] --data_dir [str] --result_dir [str] --snp_id_header [str] --allele_header  [str] --beta_header [str] --se_header [str] --pvalue_header [str] --pvalue [float/scientific notation] --output_name [str]  

Python:

from gprs.gene_atlas_model import GeneAtlasModel
if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='1000genomes/hg19',
                    data_dir='data/2014_GWAS_Height' )

    geneatlas.filter_data( snp_id_header='MarkerName',
                            allele_header='Allele1',
                            beta_header='b',
                            se_header ='SE',
                            pvalue_header='p',
                            output_name='2014height')
   
from gprs.gwas_model import GwasModel
if __name__ == '__main__':
    gwas = GwasModel( ref='/home1/ylo40816/1000genomes/hg19',
                 data_dir='/home1/ylo40816/Projects/GPRS/data/2019_GCST008970')

    gwas.filter_data( snp_id_header='RSID',
                   allele_header='Allele1',
                   beta_header='Effect',
                   se_header='StdErr',
                   pvalue_header='P-value',
                   output_name='GCST008970',
                   file_name='gout_chr1_22_LQ_IQ06_mac10_all_201_rsid.csv')

output files

  • *.QC.csv (QC files )

  • *.csv (snplist)