top of page

DNA Outbreak Investigation Using Machine Learning



You are given a data set consisting of DNA sequences (the file is available here) of the same length. Each DNA sequence is a string of characters from the alphabet ‘A’,’C’,’T’,’G’, and it represents a particular viral strain sampled from an infected individual. Your goal is to write a code that helps to identify transmission clusters corresponding to outbreaks.


The sequences should be considered as feature vectors and characters - as features. The data set is stored as a fasta file, which is essentially a text file that has the following form:


>Name of Sequence1


AAGCACAGGATGTAATGGTGGGGCCGACCGCCTATTATTCTGATGATTACTTGAGGCCCTCGGAGAGGAAGGGG


>Name of Sequence2


AAGCACAGGATGTAATGGTGGGGCCGACCGCCTATTATTCTGATGATTACTTGAGGCCCTCGGAGAGGAAGGGG


>Name of Sequence3


AAGCACAGGATGTAATGGTGGGGCCGACCGCCTATTATTCTGATGATTACTTGAGGCCCTCGGAGAGGAAGGGG


…..

Here each line starting with ‘>’ symbol contains the name of a sequence followed by the sequence itself in the next line.


You may proceed as follows:

  • 1) Read sequences from the file.

  • 2) Calculate pairwise distances between sequences. Use Hamming distance: it is the number of positions at which the sequences are different (see https://en.wikipedia.org/wiki/Hamming_distance)

  • 3) Project the sequences in 2-D space using Multidimensional Scaling (MDS) based on Hamming distance matrix.

  • 4) Plot the obtained 2-D data points. Estimate the number of clusters K by visual inspection.

  • 5) Use k-means algorithm to cluster the 2-D data points.

You may use library functions to read data from the file and perform MDS. For multidimensional scaling in python, see e.g. https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html


K-means clustering should be implemented from scratch. Your submission should contain:

  • The code of your script

  • Visualization plots for MDS with different clusters highlighted in different colors.

Please do not hesitate to ask questions.



Contact us to get instant help:

contact@codersarts.com

bottom of page