Measuring biomolecules: an image processing and length estimation pipeline using atomic force microscopy to measure DNA and RNA with high precision
Candidate: Sundstrom, Andrew
Advisor: Mishra, Bud

Abstract

Background. An important problem in molecular biology is to determine the complete transcription profile of a single cell, a snapshot that shows which genes are being expressed and to what degree. Seen in series as a movie, these snapshots would give direct, specific observation of the cell.s regulation behavior. Taking a snapshot amounts to correctly classifying the cell.s ~300 000 mRNA molecules into ~30 000 species, and keeping accurate count of each species. The cell.s transcription profile may be affected by low abundances (1-5 copies) of certain mRNAs; thus, a sufficiently sensitive technique must be employed. A natural choice is to use atomic force microscopy (AFM) to perform single-molecule analysis. Reed et al. ("Single molecule transcription profiling with AFM", Nanotechnology, 18:4, 2007) developed such an analysis that classifies each mRNA by first multiply cleaving its corresponding synthesized cDNA with a restriction enzyme, then constructing its classification label from ratios of the lengths of its resulting fragments. Thus, they showed the transcription profiling problem reduces to making high-precision measurements of cDNA backbone lengths . correct to within 20-25 bp (6-7.5 nm).

Contribution. We developed an image processing and length estimation pipeline using AFM that can achieve these measurement tolerances. In particular, we developed a biased length estimator using James-Stein shrinkage on trained coefficients of a simple linear regression model, a formulation that subsumes the models we studied.

Methods. First, AFM images were processed to extract molecular objects, skeletonize them, select proper backbone objects from the skeletons, then compute initial lengths of the backbones. Second, a linear regression model was trained on a subset of molecules of known length, namely their computed image feature quantities. Third, the model.s coefficients underwent James-Stein shrinkage to create a biased estimator. Fourth, the trained and tuned model was applied to the image feature quantities computed for each test molecule, giving its final, corrected backbone length.

Results. Training data: one monodisperse set of cDNA molecules of theoretical length 75 nm. Test data: two monodisperse sets of cDNA molecules of unknown length. Corrected distributions of molecular backbone lengths were within 6-7.5 nm from the theoretical lengths of the unknowns, once revealed.

Conclusions. The results suggest our pipeline can be employed in the framework specified by Reed et al. to render single-molecule transcription profiles. The results reveal a high degree of systematic error in AFM measurements that suggests image processing alone is insufficient to achieve a much higher measurement accuracy.