I have enjoyed defining myself as a big data analyst since my PhD in Computational Algebra (Mathematics). My interest in the fields of human biostatistics and bioinformatics began with research at the human genetics research institute at DeCODE Genetics in Reykyavik, Iceland, and continued with genomics research in obesity and liver disease at George Mason University and INOVA Fairfax Hospital, as well as at the genotyping facility at the Boston University School of Medicine. My positions at the National Center for Genome Research and the Virginia Bioinformatics Institute prepared me to perform critical analyses on very large datasets on a variety of organisms. Currently at the University of Nevada, Reno, as the Director of the Bioinformatics Center, my focus is to provide develop new and robust mathematical and (bio)statistical tools to analyze large whole-genome datasets for researchers state-wide, including GWAS studies, next-generation experiments, and Mass Spectrometry studies.
Our lab focuses on developing new and robust mathematical tools to analyze large whole-genome data sets generated on a variety of platforms.
As expression data contain significant amounts of random variation, and as clusters are dependent on the procedure applied, the assignment of confidence measures to clusters is useful. Specifically, we have implemented an algorithm in the statistical programming language R that assigns confidence measures to groupings of genes obtained by clustering routines. By the use of permutation testing and convex hull methods to simulate pseudo-random gene expression data sets, statistics are obtained from these randomly generated sets to provide a basis for comparison to the original data.
The analysis of big data is a significant challenge for the researcher. The parallel assay of thousands of data points, not all of which are independent, across a number of states or conditions, provides an interesting platform for statistical analyses and the construction of models. Although standard hierarchical clustering techniques can be applied to these data, no standard tools to identify such patterns exist. We have developed a graph-theoretic approach for constructing putative functional network models that suggest hypotheses about functions of unknown genes. This technique has been applied to several current experiments with promising results. An innovative distance metric is under development to provide a measure of similarity between any pair of genes in a more biologically grounded manner than commonly utilized distance metrics. Using these similarity relations, a bi-directional graph is generated by connecting genes based on their degree of similarity. From this graph one can detect "clusters" within the structure of the graph’s connectivity. These clusters provide hypotheses of gene function and interaction, and guide in the association of genes with biochemical pathway changes involved in stress responses and adaptive mechanisms of the organism under study. An on-going study focuses also on the post-analysis findings and the biological meaning behind clusters, an often-neglected step in expression data analysis. We are also comparing these methods to common co-expression network tools.
Complex networks are often used to model hierarchical social, biological or communication systems, as well as genetic systems. As a first approximation, Boolean networks are often used. As part of my research at the Virginia Bioinformatics Institute with Professor Reinhard Laubenbacher, we developed a method of encoding a Boolean network as a collection of simplicial complexes. We also established a combinatorial analogue of the homotopy theory of topological spaces to analyze these simplicial complexes. The resulting combinatorial invariants provide information on the dynamics of the network. By representing genetic relationships via (Boolean) network structures, applications of combinatorial homotopy theory may reveal overall network behavior and patterns of influence within and across gene subgroups.
An artificial heatmap of the intensity levels of a 2-color cDNA microarray is generated for each channel, and for the background-corrected ratio values. This image allows the user to quickly determine whether any spatial variation appears on the array, or whether control spots are behaving as predicted. Similarly, the tool is applicable to high density oligonucleotide arrays, such as those made by Affymetrix and Nimblegen™. This technique provides the researcher with a bird's eye view of each array in the experiment. The software is written in the R programming language, and is very simple to use and implement.
For the analysis of data stemming from our high-throughput genotyping experiments, we have developed a tool that automates the selection of SNPs for fine-mapping genetic associations. The tool generates a graph of genotypes from phased chromosomes that are grouped by haplotype via a hierarchical clustering approach to display long-range linkage disequilibrium patterns for a given allele of interest. We are currently using phased chromosome data from the HapMap project, and among other things, highlight those SNPs included on the Affymetrix 100K SNP GeneChip. These graphs make it possible to identify the haplotypes on which an associated SNP occurs and identify the region likely to contain the causative variant for a given association.
A separate module within HapMapper identifies SNPs that serve to distinguish haplotypes, as well as those in strong linkage disequilibrium with an associated allele, and those that are proxies for other SNPs in the region. These data are integrated into the visual display, aiding in the selection of SNPs for fine mapping haplotypes that contain the associated allele. The software is written in R and has been implemented for our use in fine-mapping several regions of interest.