Tagging SNPs and estimating haplotypes based on bovine SNP Chips data

This project is realized by Adrian Drożdż

Supervisor Joanna Szyda

The profile of work

Single Nucleotide Polymorphism (SNP) are being consider as almost perfect markers for genetic traits or whole genome scans (Genome Wide Scans - GWS). Nowadays they are used for direct and indirect gene markers depending on which approach is used. Genome Wide Association is an approach which consider SNP set (and pattern) in genome as a representative marker set for traits or trait. The second approach (genetic linkage) consider a gene or genes represented by marker (one or few SNPs) due to Linkage Disequilibrium (LD).

The number of genotyped SNPs is huge and for example in man is getting close to 4 million - depending on population analyzed.

Because of amount of SNPs it is a strong need to reduce the amount of SNPs. To do this, one can make few reduction steps:

  • use only common SNPs - consider few population which differ in number and loci of SNPs - those SNPs witch occurs in each population are common,
  • use SNPs which Minor Allele Frequency less then specific level,
  • delete from set markers which are in low Linkage Disequilibrium witch each other.

Crossing over is a process which change the parental pattern of DNA strand in chromosomes. Some parts in chromosomes goes together from generation to generation without crossing over or with a small possibility that crossing over occurs. This parts are consider to be in strong LD - against Hardy-Weinberg Equilibrium. If two or few SNPs are in LD, one can mark or tag one or two (or few) SNPs in this subset and remove other SNPs from this high LD subset to get the smallest amount of SNPs.

This process can remove huge amount of uninformative SNPs. The reduction depends on dataset, LD (crossing over) pattern and off course on algorithm used for reduction.

One can create other selection criteria:

  • consider parts in genome which cover gene or gene-reach regions,
  • DNA features such as enhancers, exons, promoters can be included for analysis for consideration with SNPs,
  • other criteria are consider in this work.

The aim of project

  • To develop a set of programs for tagSNP evaluation and selection.
  • To evaluate and select tagSNPs from bovine dataset.

Material

HapMap data analysis

Human genetics is much better known than bovine one, and bovine genome coverage isn't satisfied yet - that is why the first aim of project (to develop a set of programs) need to be done on human data. HapMap is a world project based on four population from four parts of the world and three races. For more details, please visit the project homepage (HapMap.org).

HapMap data:

  1. Populations:
    • CEU - "European ancestors in Utah"
    • CHB - "Chinese"
    • JPT - "Japanese"
    • YRI - "Yoruba people"

    see HapMap.org for more information

  2. Dataset:
    • HapMap LD data for whole chromosomes and part of chromosomes
    • HapMap genotyped SNPs for whole chromosomes and part of chromosomes

Bovine data analysis

At this time it is impossible to get data from bovine SNP chip. Even if it would - the coverage is poor (few percentages for whole genome). Before real bovine data would be available, the simulated data will be analyzed.

Methods

Steps:

  1. to use algorithms based on LD r square threshold to choose tagSNPs, implemented in:
    • TAGster
    • FESTA
    • HaploView

  2. to use other approaches
    • r packages
    • other

  3. to write own or modify existing algorithms and programs

Summary

Developed tools will be able to handle the analysis of bovine data on Unix systems smoothly and automatically. They would provide configurable parameters and wide input dataset with informative output.

Appendix

Pc's used for analysis:

  1. AMD Phenom desktop server (24h/d)
    • 4 cores AMD PHENOM 9500 2.32 GHz
    • 4 GB RAM 1120 MHz
    • 4 drives (linux software RAID0, RAID5, RAID1)

  2. Intel Xeon server (24h/d)
    • 2 x 2 cores Intel Xeon 2.2 GHz
    • 6 GB RAM 667 MHz
    • 2 drives RAID0

  3. INTEL Core 2 Duo desktop server (24h/d)
    • 2 cores Intel E6600 2.3GHz
    • 8 GB RAM 800 MHz

update 2008