Statistics Department
Purdue University
simonsen@purdue.edu
Brief Project Description:
In population genetics, the coalescent process is an important model by which
the variability of DNA sequence data can be understood. Coalescent models
incorporating genetic recombination have for the last 20 years played an
important role in understanding the effect of linkage on genetic variability
in natural populations, both theoretically and via simulation. For example,
coalescence with recombination can be used to simulate the SNP marker data
used to detect association with diseases and traits in humans and other
non-experimental populations. However, simulation with such models (including
that of Simonsen and Churchill, 1997) has suffered from a common problem:
the computational complexity (computer time and memory needed) increases
exponentially with the number of genetic loci involved, and with the population
size and recombination rate. Thus such simulations have been limited to small
numbers of loci encompassing small regions of the genome. This motivates the
development of a much more efficient computer algorithm for such simulations,
whose complexity is only polynomial in the parameters. I will describe the
special structure of the model that made such efficiency possible, and give
some timing results to show that the desired efficiency has been achieved.
This new algorithm will enable the simulation of genetic data on a genome-wide
scale.
Professor Simonson's program performed these basic operations. However, it
was very limited. Due to the huge amount of memory it required large
simulations were not possible. As a result she contacted Chinh Le,
Dan Noland, and Faisal Saied in RCS for help. They performed a number of
improvements to optimize the program.
Collectively these improvements have allowed the program to run simulations that are two orders of magnitude larger than was previously possible and to run smaller test cases considerably faster than was previously possible.
Currently the group is working to reduce the amount of memory necessary to represent the trees.