The following table gives the common offsets for
The following table gives the common offsets for each shared amino acid in sequence 1 and sequence 2. Note that when an amino acid appears multiple times, all combinations must be considered.
The lookup table indicates three common offsets with values 0, 3, and 5. Each common offset indicates a potential alignment, shown below: sequence 1: ACNGTSCHQE C S Q sequence 2: GCHCLSAGQD << offset = 0 sequence 1: ACNGTSCHQE--G C sequence 2: ---GCHCLSAGQD << offset = -3 sequence 1: ACNGTSCHQE----CH sequence 2: -----GCHCLSAGQD << offset = -5
The next step in the FASTA algorithm is to locate the offsets that contain sequence positions that are separated by a maximum distance (32 for 1 -mers, 16 for 2 -mers). Therefore, all three possible offsets in this example would be included for further analysis, since each sequence identity is separated by less than 32 characters. The regions that have the highest density of identities are then identified (offset = 0 has the highest with three identities) and assigned a score (called INIT 1) using a scoring matrix chosen by the user (e. g. , BLOSUM 62).
Our example ends here; however, further analysis is done on longer sequences that may have multiple common offsets separated by gaps. Multiple INIT 1 regions in the same sequence pair are summed and penalized for gaps to produce a new score (called INITN). Those sequences with the highest INITN scores are assigned "optimized" scores by performing a local Smith-Waterman local alignment in the INITN region. The OPT and INITN scores are used to rank the matches between the query sequence and the database sequence. Finally, as part of the output process, a full Smith. Waterman local alignment is performed on the query and database sequence and the score is reported.
- Slides: 4