concatenating sequences of several markers into one alignment


It is possible to create a phylogenetic tree with one marker, but to use a larger dataset sequences of several markers have to be concatenated.
This might cause problems, especially when not all organisms have all the sequences available. To use SequenceMatrix, a version of Java has to be installed on your computer.

SequenceMatrix is a program, that concatenates different sequences and show which markers in what strains are not available. The missing data (nucleotides) are replaced with the '?' character.
In sequences with oneven length the starting and trailing gaps are also replaced with '?' and seen as missing data. Note: real gaps in the sequences are not replaced, because they take care of a proper alignment.

Suppose there are 3 fasta containing ITS (internal transcribed spacer), EF1 (elongation factor 1) and BT2 (beta-tubulin) sequences.
To concatenate these sequences all entries of the same strain should also have the same name or ID.

A strain like KD35_v2 should have this name in every sequence. When using a fasta-formatted file:
>KD35_v2 and >KD35 v2 and >KD35v2 are considered as different entries by SequenceMatrix and in the end will not be concatenated, but imported as separate entries.

In this example 5 strains are concatenated, CBS114392, CBS114393, CBS114394, CBS114395 and CBS114396.

Start SequenceMatrix

Import 1-by-1 the sequence-files: Import - Add sequences

A popup window will ask if all 'external' gaps should be replaced by a question mark: Yes to all
This will only be asked at the first import.

remove all external gaps

Three sequences have been imported as shown in the table.
The naming convention is important; a strain shows 'No data' while there should be a sequence.
In the beta-tubulin sequence CBS114393 is written as CBS 114393, causing this problem. Now all columns (entries) have to be deleted, correct the name of the strain and start importing again.

import with error

If all is well, all strains should have one entry with the 3 sequences selected.
If the naming convention is followed and there are still entries with (No data) this means, that the particular strain is missing that sequence in the data set.

SM 3files good


Now the sequences can be exported as a concatenated data set.
SequenceMatrix exports the file in a 'nexus'-format. The best option is ("naked", e.g. for GARLI); after that, the content can be simply adapted/converted to another format. For programs, that use a nexus-format, like MrBayes, RaxML of PAUP this conversion is not necessary, but extra options can added. A conversion program, that converts the exported nexus file to a fasta-format can be found

SM export

To try it yourself, download the example files.


In case you use this program and publish an article, use this citation:
Vaidya, G., D. J. Lohman, R. Meier. SequenceMatrix: concatenation software for the fast assembly of multigene datasets with character set and codon information. Cladistics, accepted.
Accessible at: http://dx.doi.org/10.1111/j.1096-0031.2010.00329.x