DNA starts out looking simple, its just a sequence of letters over four letters C, G, A, T, but the patterns in it encode proteins, RNA enzymes, and small coding RNAs - and those are just the ones we know about. DNA also comes from different places. Between 8% and 12% of the DNA in your genome, for example, was originally the genome of various viruses. Some of the viral genes have even been co-opted by our bodies. A viral gene that suppresses immune response, for example, activates in the tissue of pregnant women, reducing their immune function, and enabling the extra-long human gestation period.
Dr. Dan Ashlock works with students and collaborators at the University of Guelph to develop software constructs that can be trained, from examples, to spot DNA with different functions or origins. One of these devices, called a side effect machine, transforms DNA samples of varying length into a fixed-sized numerical signature. This, in turn, permits a variety of standard statistical or machine learning techniques to be applied to DNA data.
Other DNA type-distinguishers developed at Guelph include woven string kernels and the do-whats-possible, on-demand string matching system. These techniques are blended with standard ones like spectrum kernels to provide a wide variety of tools for learning and distinguishing DNA types.
This project is currently active and there is room for interested students in mathematics, statistics, or bioinformatics to join.