Write a Java program to visually represent peptides identified using output from peptide identification software.
The program should read two file inputs from the user (using a File Chooser):
1. A peptide fasta file (that should be presented within the program GUI)
2. A .CSV file containing output OMSSA output*
(*) The Open Mass Spectrometry Search Algorithm OMSSA is an efficient search engine for identifying MS/MS peptide spectra by searching libraries of known protein sequences. OMSSA scores significant hits with a probability score developed using classical hypothesis testing, the same statistical method used in BLAST. OMSSA is free and in the public domain. Detailed information can be found at http://pubchem.ncbi.nlm.nih.gov/omssa/.
3. Protein should be parsed from the FASTA file, your program should graphically represent the identified peptides as retrieved by OMSSA
4. The program should also highlight the identified peptides as shown in the following figure
Figure 1 (solid blue indicated observed portion of protein): Protein X:
Figure 2 (Tip: You can use a JTextPane to use different text colour): e.g. jTextPane1.setContentType("text/html"); MRLAVGALLVCAVLGLCLAVPDKTVRWCAVSEHEATKCQSFRDHMKSVIPSDGPSVACVK KASYLDCIRAIAANEADAVTLDAGLVYDAYLAPNNLKPVVAEFYGSKEDPQTFYYAVAVV KKDSGFQMNQLRGKKSCHTGLGRSAGWNIPIGLLYCDLPEPRKPLEKAVANFFSGSCAPC ADGTDFPQLCQLCPGCGCSTLNQYFGYSGAFKCLKDGAGDVAFVKHSTIFENLANKADRD QYELLCLDNTRKPVDEYKDCHLAQVPSHTVVARSMGGKEDLIWELLNQAQEHFGKDKSKE FQLFSSPHGKDLLFKDSAHGFLKVPPRMDAKMYLGYEYVTAIRNLREGTCPEAPTDECKP VKWCALSHHERLKCDEWSVNSVGKIECVSAETTEDCIAKIMNGEADAMSLDGGFVYIAGK CGLVPVLAENYNKSDNCEDTPEAGYFAIAVVKKSASDLTWDNLKGKKSCHTAVGRTAGWN IPMGLLYNKINHCRFDEFFSEGCAPGSKKDSSLCKLCMGSGLNLCEPNNKEGYYGYTGAF RCLVEKGDVAFVKHQTVPQNTGGKNPDPWAKNLNEKDYELLCLDGTRKPVEEYANCHLAR APNHAVVTRKDKEACVHKILRQQQHLFGSNVTDCSGNFCLFRSETKDLLFRDDTVCLAKL HDRNTYEKYLGEEYVKAVGNLRKCSTSSLLEACTFRRP
NB: These figures are made up examples – they don’t represent any real protein!
You are provided with 2 OMSSA output files (.csv) that you can use to develop your program
• Some extra (large protein files) for further testing in case you implemented additional functions
At a minimum you would need to use the data from these columns:
• Start: The position in the protein where the peptide starts.
• Stop: The position in the protein where the peptide stops.
• Defline: This gives the name of the protein to which the peptide has been mapped - most importantly the accession number is given between the first two bars (|).
There are lots of other refinements that could be done, like allowing the user to choose from proteins that exist in the file, annotating the output with data from the other columns like p-value (how likely the identification is to be correct)
• Reading a fasta file with multiple proteins and highlighting the corresponding peptides (See next Screenshot)