Tom Willis

The first rule of boat club is that you only ever talk about boat club.

I’m a final-year PhD student in the Wallace lab at the MRC Biostatistics Unit at the University of Cambridge, waiting for my viva in November ‘24 and looking for work in the meantime. I did my undergraduate degree in CS and my Master’s in statistics, both at the University of Leeds. For my PhD I investigated the contribution of common variants to two rare diseases, the antibody deficiencies selective IgA deficiency and common variable immunodeficiency. I used a pleiotropy-informed method, the conditional false discovery rate, to overcome the problem of small sample sizes. I also spent a lot of time rowing for my college, Catz. You can find my CV (resume) here.

Common-variant analysis means GWAS and I’ve worked on this methodology in an end-to-end way:

imputation of genotypes from SNP microarray data with the Michigan Imputation Server
QC’ing microarray and WGS data and otherwise manipulating it with plink(2)
running GWAS with fastGWA-GLMM from the GCTA package
dealing with genotype and phenotype data from the UK Biobank
hit work-up using the excellent Open Targets Genetics API
heritability and genetic correlation estimation using sumher (part of LDAK) and ldsc
my own improvement of an existing method of detecting genetic similarity, the ‘GPS test’, which was published but did not prove too useful in my own work

I’ve also spent too much time visualising GWAS data on the genome-wide and locus-specific scales. I submit the occasional PR to locuszoomr, a very good R package for drawing such plots, and I also wrote a now likely outdated tutorial on karyoploteR for the same end.

Earlier in my career I did a lot of bulk RNA-seq differential expression analysis: first in bacteria (published here) and viruses at Novartis in Emeryville, California as a bioinformatics intern, then in African cichlid fish (published here) when I was an Amgen Scholar at the Department of Genetics in Cambridge. I also did some investigation of zero inflation in scRNA-seq data as a Master’s student which alas did not yield anything substantial enough to publish.

Early on in my PhD I forked a tool for preprocessing diverse GWAS summary statistics files into a uniform format, GWAS_tools. The original was written by my colleague Dr Guillermo Reales; I developed my fork into a snakemake pipeline (see below) which I like rather more and which better meets my requirements for common-variant analysis. I’ve since spun out code from this pipeline for downloading and processing the 1000 Genomes Phase 3 data (both the older hg19 and the newer high-coverage hg38 versions) into its own workflow. I hope this will prove useful given the routine use of the 1kGP data in GWAS-related analyses.

I’m keen on reproducible bioinformatics and for the past few years I’ve worked almost exclusively in snakemake; some examples of the pipelines I’ve published alongside manuscripts are here and here. I work mostly in R and python, but when the situation calls for it I’ve written performant code in C++, calling it in R thanks to the excellent Rcpp and arma packages (e.g. the vl_mode2_arma function in my fork of the cFDR package). I’ve also written standalone programs in C++. Having had some memorably bad experiences with software installation, I’ve taken the time to learn the use of technologies like cmake, conda, docker, and singularity to make deployment as painless as possible for myself and my users. In maintaining my boat club’s website, I also learned some ‘recreational’ Javascript in the guise of Vue 3. At Novartis I also spent a very long time refactoring a large bioinformatics application written in Java, using gradle and JUnit along the way.

You can see my publication record on Google Scholar.