• Data Science Engineer


    Machine Learning, Fraud Detection, Lead Prioritization, Time Series Analysis.

    December 2015 - Present

  • Data Scientist

    Dow Jones

    Machine Learning, NLP, vizualization.

    November 2014 - December 2015

  • Fellow

    Insight Data Science

    An intensive seven week post-doctoral training fellowship bridging the gap between academia and data science.

    July 2014 - September 2014

  • Postdoctoral Associate

    Weill Cornell Medical College

    Biostatistics, Inference of Time-Varying Networks.

    January 2012 - November 2014

  • Ph.D in BioStatistics

    University of Bristol

    Machine Learning, Bayesian Statistics and way too much MCMC.

    October 2009 - January 2012

  • Masters in Complexity Sciences

    Bristol Centre for Complexity Sciences

    Interdisplinary projects in Biostatistics.

    September 2008 - October 2009

  • Bachelors: Double Major in Mathematics & Physics

    University of Bristol

    Mathematics and Physics

    October 2004 - September 2008

Side Projects

In Press

As part of my current position as Data Scientist at Dow Jones, I have the fantastic opportunity to regularly collaborate on NewsRoom project with reporters and editors of the Wall Street Journal. Here are a few examples of some publised work.


In a previous life, I spent my time developing statistical tools and libraries to help address questions relevant to the fields of Genetic Medicine and Protein Folding. During this time, I was able to contribute to a number of manuscripts that have been publised in peer-reviewed journals.

Smoking Dysregulates the Human Airway Basal Cell Transcriptome at COPD Risk Locus 19q13.2
Massive parallel RNA sequencing was used to compare the transcriptome of BC purified from the airway epithelium of healthy nonsmokers (n = 10) and healthy smokers (n = 7). The chromosomal location of the differentially expressed genes was compared to loci identified by GWAS to confer risk for COPD. Smoker BC have 676 genes differentially expressed compared to nonsmoker BC, dominated by smoking up-regulation. Strikingly, 166 (25%) of these genes are located on chromosome 19, with 13 localized to 19q13.2 (p<10−4 compared to chance), including 4 genes (NFKBIB, LTBP4, EGLN2 and TGFB1) associated with risk for COPD. These observations provide the first direct connection between known genetic risks for smoking-related lung disease and airway BC, the population of lung cells that undergo the earliest changes associated with smoking.
Click to view paper

High Correlation of the Response of Upper and Lower Lobe Small Airway Epithelium to Smoking
The distribution of lung disease induced by inhaled cigarette smoke is complex, depending on many factors. With the knowledge that the small airway epithelium (SAE) is the earliest site of smoking-induced lung disease, and that the SAE gene expression is likely sensitive to inhaled cigarette smoke, we compared upper vs. lower lobe gene expression in the SAE within the same cigarette smokers to determine if the gene expression patterns were similar or different.
Click to view paper

LOGICOIL—multi-state prediction of coiled-coil oligomeric state
This work introduces LOGICOIL, the first algorithm to address the problem of predicting multiple coiled-coil oligomeric states from protein-sequence information alone. By covering >90% of the known coiled-coil structures, LOGICOIL is a net improvement compared with other existing methods, which achieve a predictive coverage of ∼31% of this population. This leap in predictive power offers better opportunities for genome-scale analysis, and analyses of coiled-coil containing protein assemblies.
Click to view paper

SCORER 2.0: an algorithm for distinguishing parallel dimeric and trimeric coiled-coil sequences
The predominant coiled-coil oligomer states in Nature are parallel dimers and trimers. Here, we improve and retrain the first-published algorithm, SCORER, that distinguishes these states, and test it against the current standard, MultiCoil. The SCORER algorithm has been revised in two key respects: first, the statistical basis for SCORER is improved markedly. Second, the training set for SCORER has been expanded and updated to include only structurally validated coiled coils.
Click to view paper

A basis set of de novo coiled-coil peptide oligomers for rational protein design and synthetic biology
Protein engineering, chemical biology, and synthetic biology would benefit from toolkits of peptide and protein components that could be exchanged reliably between systems while maintaining their structural and functional integrity. Ideally, such components should be highly defined and predictable in all respects of sequence, structure, stability, interactions, and function. To establish one such toolkit, we present a basis set of de novo designed α-helical coiled-coil peptide.
Click to view paper

The evolution and structure prediction of coiled coils across all genomes
Coiled coils are α-helical interactions found in many natural proteins. Various sequence-based coiled-coil predictors are available, but key issues remain: oligomeric state and protein–protein interface prediction and extension to all genomes. We present SpiriCoil (http://supfam.org/SUPERFAMILY/spiricoil), which is based on a novel approach to the coiled-coil prediction problem for coiled coils that fall into known superfamilies: hundreds of hidden Markov models representing coiled-coil-containing domain families.
Click to view paper

Prediction and analysis of higher-order coiled-coils: insights from proteins of the extracellular matrix, tenascins and thrombospondins
LOGICOIL out-performed other algorithms in predicting trimerisation of these proteins and sequence analyses identified features associated with many other trimerising CCDs. The thrombospondins are a larger and more ancient family that includes sub-groups that assemble as trimers or pentamers. LOGICOIL predicted the pentamerising CCDs accurately. However, prediction of TSP trimerisation was relatively poor, although accuracy was improved by analyzing only the central regions of the CCDs. Sequence clustering and phylogenetic analyses grouped the TSP CCDs into three clades comprising trimers and pentamers from vertebrates, and TSPs from invertebrates. Sequence analyses revealed distinctive, conserved features that distinguish trimerising and pentamerising CCDs. Together, these analyses provide insight into the specification of higher-order CCDs that should direct improved CCD predictions and future experimental investigations of sequence-to-structure functional relationships..
Click to view paper

Airway Basal Stem/Progenitor Cells Have Diminished Capacity to Regenerate Airway Epithelium in Chronic Obstructive Pulmonary Disease
Click to view paper

Designed coiled coils promote folding of a recombinant bacterial collagen
We propose that a single CC promotes folding of the CL domain via nucleation and in-register growth from one end, whereas initiation and growth from both ends in CC-CL-CC results in mismatched registers that frustrate folding. Bioinformatics analysis of natural collagens lends support to this because, where present, there is generally only one coiled-coil domain close to the triple helix, and it is nearly always N-terminal to the collagen repeat.
Click to view paper