Supplimentary data for paper:

Whole-genome bisulfite sequencing of multiple individuals reveals the quantitative roles of DNA methylation in transcriptional regulation

Shaoke Lou 1,2, Hao Qin 2, Jing-Woei Li 2, Zhibo Gao 3, Xin Liu 3, Landon L. Chan 1,2,
Vincent K. L. Lam4,5,6, Heung-Man Lee 4,5,6, Wing-Yee So 4,5,6, Ying Wang 4,5,6, Si
Lok 6, Jun Wang 3, Ronald C. W. Ma 4,5,6, Stephen Kwok-Wing Tsui 7,8,9, Juliana C. N.
Chan 4,5,6, Ting-Fung Chan 2,8,9,and Kevin Y. Yip 1,8,9

1 Department of Computer Science and Engineering,
2 School of Life Sciences,The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
3 Beijing Genomics Institute (BGI)-Shenzhen, Shenzhen, China
4 Department of Medicine and Therapeutics,
5 Hong Kong Institute of Diabetes and Obesity,
6 Li Ka Shing Institute of Health Sciences,
7 School of Biomedical Sciences,
8 Hong Kong Bioinformatics Centre,
9 CUHK-BGI Innovation Institute of Trans-omics,
The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong

Two directories are in this folder.
raw/--the annoatation file and  raw data used to extract methylation and expression features

trio.hg19.cout.nchrM.sort.bed: the methylome data of a trio-family. The data is lifted from hg18 to hg19
                               and is sorted by chromosome and start position
                               The format is: chr, start (0-based), end, methylation. where methylation is 
                               represented by a string, which follows the regular expression: 
                               (\d+)_(0|1)_(0|1)_(0|1)_(0|1). The group1 of thre regular expression is 
                               position (1-based) based on hg18 genome; the group2,3,4 is methylation 
                               status for trio-father, mother and daughter, and the number '1' means 
                               methylated,'0' means unmethylated; The last group (4), present strand, 
                               '1' is plus strand and '0' is minus strand.

(f|m|d).sort.sam.20120616t1.exp: expression level of RNA-Seq data for trio family. The format is : id, 
                               gene_id, gene_name, chromosome, gene_length, reads_count, rpkm. 'f' means
                               father, 'm' means mother, and 'd' means daughter

gencode_v7.level12.sort.ucsc: UCSC bed format of annatation file, which includes level 1 and 2 of genes 
                              from Gencode version 7 annotation file. The gene structure is composited 
                              as described in Methods. 

processed/-- .arff file for weka.

merge(M|MG|ML|MGL).121024.5sample.arff: merged dataset includes: trio-father (fam_id:0), trio-mother(fam_id: 1), 
                              trio-daughter(fam_id:2), H1 (fam_id: 3), IMR90 (fam_id:4). 'M' means mCG, 
                              'MG' means mCG/CG, 'ML' means mCG/Length, 'MGL' means mCG/CG/length

trio(M|MG|ML|MGL).newclass.121024.arff: trio dataset includes: trio-father (fam_id:0), trio-mother(fam_id:1)
