Voor de fylogenie van schimmels wordt gebruik gemaakt van DNA sequenties
Een mooie hobby om zelf electronica projecten te solderen
Veel watervallen zoals Foz do Iguacú
Fjorden
De Lange Muur
Amanita muscaria - licht giftig
Dimorfe schimmel dat ernstige infecties kan veroorzaken
This course is meant to work with different kinds of alignment programs and tree contruction software.
There is a lot of available software out there, free and paid, but also web portals to do the job.
Web portals, however, are often limited to the size of the data to prevent users uploading gigabytes of data.
Your computer, depending on the physics like cpu, available RAM etc. makes it possible to handle large data sets.
PhyloMain v6, that is available via this website, is using freely available software to do the job.
Much more software on this topic is available on the internet and this website.
Updated version: 2020-11-15
If you already downloaded the package then use the next download to update the PhyloMain program:
Create a folder .e.g. D:\PhyloMain. Move the zip file into this folder and unzip the file here. A folder PhyloMain is created that contains the starting program and the subfolders Examples, Manual and Progs.
Move the subfolder Progs to C:\Users\Public (in Dutch: C:\Gebruikers\Openbaar).
In D:\PhyloMain start the program by double-clicking PhyloMain.exe; there is a newer version available down at this webpage. It offers some benefits.
When everything is correct, you will see this window appearing on your monitor.
Comments and questions can be addressed via contact form.
In case you didn't move the 'Progs' folder to C:\Users\Public, you have to assign every single program to the new directory.
E.g. if the 'Progs' folder is still in the installation folder then 'progsdir' should become D:\PhyloMain\Progs
Double-click on the specific program and choose the right folder and program; some are .exe files, others .bat
Look at the examples in the window:
bioedit = C:\Users\Public\Progs\BioEdit\BioEdit.exe and should be changed into D:\PhyloMain\Progs\BioEdit\BioEdit.exe
mafft = C:\Users\Public\Progs\mafft-win\mafft.bat should be changed into D:\PhyloMain\Progs\mafft-win\mafft.bat
and so on.
What is phylogeny? These are definitions found in dictionaries:
1. The evolutionary development and history of a species or trait of a species or of a higher taxonomic grouping of organisms
2. A model or diagram delineating such an evolutionary history
3. A similar model or diagram delineating the development of a cultural feature
4. the development or evolution of a particular group of organisms
5. the sequence of events involved in the evolutionary development of species or taxonomic group of organisms
Phonetics: faɪˈlɒdʒɪnɪ
A phylogeny is a hypothetical relationship between groups of organisms being compared. A phylogeny is often depicted using a phylogenetic tree, describing the evolutionary relationships between organisms.
It will take some days to fully explain and understand phylogeny, but there are some important terms you need to know if you want to read phylogenetic trees:
synapomorphy: trait or structure derived from a common ancestor
monophyly: group of organisms (clade) that share a most recent common ancestor
paraphyly: group of organisms that don't share a common ancestor
homology: trait or structure that was present in the common ancestor and persisted in the evolved lineages, but might differ in form or function
homoplasy: a shared trait or structure between two or more organisms, that did not evolve from a common ancestor, the opposite of homology
convergent evolution: process whereby species that are not closely related, independently evolve functionally or visually similar structures or traits
divergent evolution: response to abiotic or biotic factors whereby organisms may develop homologous traits or structures and leads to speciation
These are 5 common abiotic factors, that might influence speciation:
atmosphere, chemical elements, sunlight/temperature, wind and water, but also nutrient enrichment, nonliving things like rocks, mountains, air and soil.
Biotic factors are based on living things, like organic matter, ratio of predators and prey, population size, presence of phyto- and zooplankton etc.
To be prepared, it is advisable to visit this website:
https://biologydictionary.net/phylogeny/
In the installation directory (D:\PhyloMain) the 'Examples' folder contains a few example files:
ALL_Dermas.fas, Clades_align_nodup_wsl.fas and LSU33.fas.
During this course we will work with these example files.
These files are prepared for the course. If you use your own fasta files always be sure the file contains NO blank lines.
As you can see, we are going to use files in fasta format mainly. Fasta formatted files are the standard files nowadays.
There are still phylogenetic programs, that don't use fasta formatted files, like MrBayes, RaxML and PAUP. They use nexus formatted files.
Nexus formatted files consist of blocks of information that contain strain information, parameters, calculation setup etc.
The first line of every nexus file is #NEXUS.
Here are the first 3 sequences of the file:
>CBS12996
TCAGTAACGGCGAAGTGGAAGCGGCAACAGCTCAAATTTGAAATTCTGAGCCTCTTCGGGGTCCGAGTTGTAATTTGGAG
AGGATGTTTCGGGCACGGTCCGGGCTTATATTTCTTGGAACAGAATGTCATAGAGGGTGAGAATCCCGTCTGAGAGCTCG
GACACGACCTATGTGAAACTCCTTCGACGAGTCGAGTTGTTTGGGAATGCAGCTCTAAATGGGTGAGTAAATTTCATCTA
AAGCTAAATATTGGCCAGAGACCGATAGCGCACAAGTAGAGTGATCGAAAGATGAAAAGCACTTTGAAAAGAGAGTTAAA
AAGTACGTGAAATTGTTGAAAGGGAAGCGCTGGCGACCAGACTTGCGCGTCGGGGTTCCCCCTTGCTTCTGCTTGGGTTA
CTCCCCGGCGTTCAGGCCAACATCGGTTTCGGGGGTTGGTTAAAG
>CBS11385
TCAGTAACGGCGAGTGAAGCGGCAACAGCTCAAATTTGAAATCTGGCCTCTGCGGGGTCCGAGTTGTAATTTGGAGAGGA
TGTTTCGGGCACGGTCCGGGCTTAAATTTCTTGGAACAGAATGTCACAGAGGGTGAGAATCCCGTCTGGAGTCCGGACAC
GGCCCATGTGAAACTCCTTCGACGAGTCGAGTTGTTTGGGAATGCAGCTCTAAATGGGTGGTAAATTTCATCTAAAGCTA
AATACTGGCCAGAGACCGATAGCGCACAAGTAGAGTGATCGAAAGATGAAAAGCACTTTGAAAAGAGAGTTAAACAGTAC
GTGAAATTGTTGAAAGGGAAGCGCTGTCAACCAGACTTGCGCGTCGGGGTTCCCCCTTGCTTCTGCCTGGGTTACTCCCC
GGCGTTCAGGCCAACATCGGTTTCGGGGGTTGGTTAAAG
>CBS98596
CTAGTAACGGCGAGTGAAGCGGCAAGAGCTCAAATTTGAAATCTGGCTCTTTCAGAGTCCGAGTTGTAATTTGTAGAGGA
TGTTTCGGACACGACCCCGGTTTAAATTTCTTGGAACAGAATGTCAAAGAGGGTGAGAACCCCGTCTTGAGCCGGCGGTA
CGGTCTATGTGAAACTCCTTCGACGAGTCGAGTTGTTTGGGAATGCAGCTCAAAATGGGTGGTAAATTTCATCTAAAGCT
AAATATTGGCCAGAGACCGATAGCGCACAAGTAGAGTGATCGAAAGATGAAAAGCACTTTGAAAAGAGAGTTAAACAGTA
TGTGAAATTGTTGAAAGGGAAGCGCTTGCAACCAGACTTGAGCGCGGTGGTTCCCCCTTCCTTCTGGTTGGGCTATTCCA
CCGTGTCCAGGCCAACATCAGTTTTGGCGGCCGGTTAAAG
It is saved in a fasta-format. Every sequence is starting with '>' followed by an ID of the strain and the sequence on a new line.
The '>' line can contain more information.
Open file Clades_align_nodup_wsl.fas in the same way with Notepad.
Right-click on Clades_align_nodup_wsl.fas and select 'Open with' and choose Notepad.
>1|CBS11463|Spathasporaarborariae|Debaryomycetaceae_Lodderomyces_Scheffersomyces
CCGTGGTAATTCTAGAGCTAATACATGCTTAAAACCCCGACTGTTTGGAAGGGGTGTATTTATT
AGATAAAAAATCAAGATGATTCATAATAACTTTTCGAATCGCATGGCCTTGTGCTGGCGATGGTTCATTCA
AATTTCTGCCCTATCAACTTTCGATGGTAGGATAGTGGCCTACCATGGTTTCAACGGGTAACGGGGAATAAGGGTTC
GATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCAATCCC
GATACGGGGAGGTAGTGACAATAAATAACGATACAGGGCCCTTTTGGGTCTTGTAATTGGAATGAGTACAATG
TAAATACCTTAACGAGGAACAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAAAAGCGT
ATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAACCTTGGGGGTCCGCCTTTCTGGCGAGTACTT
TCTTCTGACTTTTACTTTGAAAAAATTAGAGTGTTCAAAGCAGGCCTTTGCTCGAATATATTAGCATGG
AATAATAGAATAGGACGTTATGGTTCTATTTTGTTGGTTTCTAGGACCATCGTAATGATTAATAG
GGACGGTCGGGGGTATCAGTATTCAGTTGTCAGAGGTGAAATTCTTGGATTTACTGAAGACTAACTACTGCG
AAAGCATTTCCAAGGACGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGATGATCAGATACC
GTCGTAGTCTTAACCATAAACTATGCCGACTAGGGATCGGGCTGCGCAATCGGCACCTTACGAGAAATCAAAG
TCTTTGGGTTCTGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGA
CGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTCACCAGGTC
CAGACACAATAAGGATTGACAGATTGAGAGCTCTTTCTTGATTTTGTGGGTGGTGGTGCATGGCCGTTCTTAG
TTGGTGGAGTGATTTGTCTGCTTAATTGCGATAACGAACGAGACCTTAACCTGCTAAATAGTGCTGCTAGC
TTTTGCTGCTTCTTAGAGGGACTATTCAAGTCGATGGAAGTTTGAGGCAATAACAGGTCTG
TGATGCCCTTAGACGTTCTGGGCCGCACGCGCGCTACACTGACGGAGCCAGCGAGTATAAACCTTGGCCGAG
AGGTCTGGGAAATCTTGTGAAACTCCGTCGTGCTGGGGATAGAGCATTGCAATTATTGCTCTTCAACGAGGAATTC
CTAGTAAGCGCAAGTCATCAGCTTGCGTTGATTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCTACTACCGATTGAA
TGGCTTAGTGAGGCCTCCGGAGGCAACAAGCTGGTCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACA
AGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTAAAAACTTTCAACAACGGATCTCTTGGTT
CTCGCATCGATGAAGAACGCAGCGAAATGCGATAGTAATTGAATTGCAGATATTCGTGAATCATCGAATCTTT
GAACGCACATTGCGCCCTCTGGTATTCCGGCATGCCTGTTTGAGCGTTGAACCTCAAATC
AGGTAGGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAGGAAAA
GAAACCAACAGGGATTGCCTTAGTAGCGGCGAGTGAAGCGGCAAAAGCTCAAATTTGAAAGA
GTTGTAATTTGAAGAAGGTATCTTTGGGTCTGGCGCTTG
TCTATGTTCCTTGGAACAGGACGTCACAGAGGGTGAGAATCCCGTCGAAGAGTCGAGTTGTTTGGGAATGCAGCTCT
AAGTGGGTGGTAAATTCCATCTAAAGCTAAATATTGGCGAGAGACCGATAGCGAACAAGTACAGTGATGGAAAGATGA
AAAGAACTTTGAAAAGAGAGTGAAAAAGTACGTGAAATTGTTGAAAGGGAAGGGTGCGATCAGACGGCCAGCATCGGT
TCGGGTGGCAGGATAATTGCGGAGAAATGTGGCGTGTTATAGTCTCTGT
CGATACTGCCTGCCTGGACCGAGGACTGCAGGATGCTGGCATAATGATCGTAAGCCGCCCGTCTTGA
AACACGGACCAAGGAGTCTAACGTCTATGCGAGTGTTTGGGTGTTAAACCCGTACGCGTAATGAAAGT
GAACGTAGGTGAGGTGCATCATCGACCGATCCTGATGTCTTCGGATGGATTTGAGTAAGAGCAT
AGCTGTTGGGACCCGAAAGATGGTGAACTATGCCTGAATAGGGTGAAGCCAGAGGAAACTCTGGTG
GAGGCTCGTAGCGG
It contains much longer sequences. The '>' line shows information separated with the piping symbol | .
This symbol comes in handy when we want to select certain information.
Open file ALL_Dermas.fas with Notepad.
Right-click on ALL_Dermas.fas and select 'Open with' and choose Notepad.
There are different ways to add sequences to an existing file.
1. Copy & paste from a file
2. Use a program, like BioEdit and import sequences
3. Collect sequences from Genbank
We are going to collect sequences from Genbank. In PhyloMain click on 'Genbank search'.
Parameters:
nucleotides
Organism: Trichophyton violaceum
Product: internal transcribed spacer 1
Sequence length between 300 and 600 bp
Save output as: D:\PhyloMain\Examples\gb_export.fas
Max. items to export: 5
Press 'Start search'
If you open gb_export.fas with Notepad, you will see that the '>' line contains much more information than the lines in 'ALL_Dermas.fas'.
It is up to us if we want to keep it this way or edit these lines to look the same.
CBS 730.88 -> CBS_703.88
CBS 305.60 -> CBS_305.60
IHEM 1279 -> IHEM_1279
IHEM 19041 -> IHEM_19041
IHEM 4711 -> IHEM_4711
Suppose we also want to keep the Genbank accession number as well. We could separate this number and the strain number with the piping character.
The lines would look like this (make these changes in the gb_export.fas):
>MK806614.1|CBS_703.88
>MK806613.1|CBS_305.60
>MK806612.1|IHEM_1279
>MK806611.1|IHEM_19041
>MK806610.1|IHEM_4711
Now we have a problem, because the original file 'ALL_Dermas.fas' doesn't have this information. For normal use this will not cause any problems, but if we want to use the piping character to select certain information, e.g. only strain number, it will cause an error. To avoid the problem we can add the piping character to all 237 strains in 'ALL_Dermas.fas'.
Select the Notepad window with the opened 'ALL_Dermas.fas' file.
Crtl-H: search for > and replace with >|
Copy the contents of 'gb_export.fas' and paste it at the end of 'ALL_Dermas.fas'.
Save As: 'ALL_Dermas_gb.fas'
Close 'gb_export.fas' (we don't need this anymore)
FOR MOST COMPUTER WORK CONSISTENCY IS THE KEY.
Search (Ctrl-F) 305.60
>|CBS_305.60
Press F3 (Find next)
>MK806613.1|CBS_305.60
The strain occurs two times, but for most programs these are different entries, because the '>' line is different. We can decide later to keep both entries.
However, if there are two entries of the same strain and they have the same '>' line, this will cause an error during alignment or tree construction.
e.g. both lines look like '>|CBS_305.60' then we have to delete one of them or change one of them into e.g. '>|CBS_305.60B'.
Open Notepad and type the following numbers on separate lines
MN737950.1
MN737949.1
MN737948.1
MN737947.1
MN737946.1
Save the file as: gb_accn.txt (in D:\PhyloMain\Examples)
In PhyloMain Genbank Search 'Reset values'.
Open accn-file: browse for gb_accn.txt
Click 'Start file contents'
When the sequences are collected it will show an information window with the name of the saved file (results.fas) and directory.
Open 'BioEdit'. If you moved the Progs folder to C:\Users\Public, you have to open this folder:
C:\Users\Public\Progs\BioEdit and start the program by double-clicking 'BioEdit.exe'.
Open 'ALL_Dermas.fas' (Ctrl-O)
From the menu: File - Import - Sequence alignment file
Select 'gb_export.fas'
Save the new file as 'ALL_Dermas_gb.fas' (which already exists or choose a different name)
Exercise
Do you remember, that we looked at D:\PhyloMain\Examples\Clades_align_nodup_wsl.fas ?
This file contains '>' lines with the piping character '|'.
Create a new fasta file with only the strain number and genus/species name for all Candida species. Try this before looking at the solution.
Choose 'Export selection' in PhyloMain
Open Fasta file: D:\PhyloMain\Examples\Clades_align_nodup_wsl.fas
Save Selection As: (change default value) D:\PhyloMain\Examples\Clades_Candida.fas
Check 'Reduce strain info' and select 'Split character' (piping symbol |)
Check the boxes for strain number and species name (2nd & 3rd checkbox)
Export sequences (search): Candida
Click 'Export selection'
Result: 'Total number of sequences exported: 499'
When you check the checkbox "Only '>' lines" and click 'Show' you will see non-Candida strains at the end of the list.
Why is this?
The search looks for any occurrence of 'Candida' in the '>' line. It seems, that these entries also have 'Candida', probably as a synonym.
This is the start before creating any cladograms. There are a lot of alignment programs available for nucleotide sequences as well as amino-based sequences. But how reliable are these programs in aligning the dataset the right way. And what is the right way. You can imagine, that sequences that look very similar, most of the programs will not have any problem to align them. But what happens if the sequences are variable, even on species level. When you are dealing with alignments of 20 or 50 strains then it is still possible to edit the alignment manually. However, the amount of strains to do phylogeny becomes much larger as well as the length of the sequences. It becomes time consuming to edit 1000 strains with sequences of 3000 bp.
We have to rely on the quality of the alignment.
In PhyloMain 3 well-known alignment programs can be selected: Muscle, MAFFT and ClustalW. In general you could say that Muscle and ClustalW are used for somehow similar sequences, although they can handle sequences with higher variability. MAFFT has more options to choose from like local or global alignment. What program should I use? That is difficult to say. If you understand the contents of the dataset used, you might be able to use the right software. Otherwise it is a case of trial & error.
In PhyloMain select 'Start alignment'
Muscle is selected as default. Some parameters can be set, like iterations and -diags. For small datasets that contain similar sequences an number of iterations of 8 should suffice. Otherwise it should be set to 12 or 16. Muscle will stop iterating when it reached the best possible alignment. When sequences look very similar, e.g. there are several sequences of the same species with low variability checking -diags will cluster these sequences and treat them as one preventing unnecessary comparisons and finish the alignment faster.
MAFFT has more options. As default FFT-NS-i is selected. This protocol will try to find the best possible parameters for aligning based on the dataset. FFT-NS-x will create 1 or 2 trees and uses the trees as a reference to establish an improved alignment. The other options are based on the dataset and can be used if we know the contents of our dataset. The next schemes is trying to make an attempt to explain what are the differences.
global alignment (G-INS-i) - all residues can be aligned over the total length.
XXXXXXXXXXX-XXXXXXXXXXXXXXX
XX-XXXXXXXXXXXXXXX-XXXXXXXX
XXXXX----XXXXXXXX---XXXXXXX
XXXXX-XXXXXXXXXX----XXXXXXX
XXXXXXXXXXXXXXXX----XXXXXXX
local alignment (L-INS-i) - the dataset contains one alignable domain flanked by non-alignable residues.
ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------
--------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo-------
------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo-------
--------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo
--------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------
affine gaps (E-INS-i) - the dataset contains alignable and non-alignable domains, where the non-alignable domains are not being aligned.
oooooooooXXX------XXXXooooooooooo---------------oooooooXXXXXooooooooooooooooo--ooooooooooooooo
---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-ooooooooooooooooooooooo----------
oooooooo-XXXXX----XXXX---------------------------------XXXXX---oooooooooo--oooooooooooo-------
---------XXXXXX---XXXX---------------------------------XXXXX----------------------------------
---------XXXXXXXXXXXXX---------------------------------XXXXX----------------------------------
---------XX-------XXXX---------------------------------XXXXX----------------------------------
'adjust direction' is checked. This will take care, that all sequences are in the same direction based on the first sequence in the file. If not, the sequence with a wrong direction will be reversed. The ID of the strain will be changed by adding '_R_' after the '>', e.g. >_R_CBS305.60
ClustalW uses iterations to establish a good alignment, mostly 3-5 iterations are used. It is possible to use a NJ reference tree with bootstrapping. It goes without saying, that calculation time will increase.
Select option 'Muscle' if it is not selected
Iterations 8
because the sequences are very similar (LSU is quite conserved) we can check -diags
Open Fasta file: D:\PhyloMain\Examples\LSU33.fas
Save Alignment As: D:\PhyloMain\Examples\LSU33_muscle.afa
It is good policy to use the same filename and adding the alignment method. As well as changing the extension to .afa to distinguish unaligned (.fas) and aligned (.afa) fasta files.
Don't change anything else and 'Start alignment'.
If you want to open the file after alignment is done, check 'Open alignment with BioEdit'.
Select option 'MAFFT' if it is not selected
FFT-NS-i
Open Fasta file: D:\PhyloMain\Examples\ALL_Dermas_gb.fas
Save Alignment As: D:\PhyloMain\Examples\ALL_Dermas_gb_mafft.afa
Check 'Open alignment with BioEdit' and 'Start alignment'.
When you look through the alignment you will see aligned domains with difficult or non-aligned regions. Close the BioEdit program. Repeat the alignment with the following parameters:
L-INS-i
Save Alignment As: D:\PhyloMain\Examples\ALL_Dermas_gb_mafft_L.afa
'Start alignment'
Do you think this is a better alignment?
When looking at the two different aligment methods with MAFFT you will see that the alignment results are influenced by the method chosen. According to me the alignment with option L-INS-i looks better.
Let's have a closer look.
In BioEdit select 'Edit - Search - Find in titles' and type CBS_117.61
When you scroll to the end of the alignment you will see that ITS2 region is missing. CBS_374.92 has a similar problem where ITS1 region is missing.
Some tree construction programs take these gaps as a fifth parameter next to A, T, C and G, others ignore them entirely. Which means, that large parts of the alignment is not used for tree construction at all.
Delete these sequences: select the sequence and Ctrl-Delete to delete it from the alignment. Save the file.
Furthermore, we still have two sequences for CBS_305.60, although one has also a Genbank accession number. This one is shorter and can be deleted; search for MK806613 and delete the sequence.
The alignment contains 237 sequences. Because most of the sequences are missing the Genbank number we are going to create a new file without these numbers using the piping character.
Go to PhyloMain and select 'Export selection'
Open Fasta file: D:\PhyloMain\Examples\ALL_Dermas_gb_mafft_L.afa
Save Selection As: (change default name) D:\PhyloMain\Examples\ALL_Dermas_gb_mafft_L_CBS.afa
Check 'Reduce strain info' and select 'Split character' the piping character | (our data is separated with this character)
The first sequence of the file is used to obtain the information separated by the character. The first sequence doesn't contain a Genbank number and therefor the checkbox marker is empty.
Check the box with the CBS number (only this field will be exported).
Click 'Export selection'
The 'Total number of sequences exported' should show 239.
The sequence alignment now contains only lines with a strain number without Genbank numbers, e.g. >CBS_305.60 and without piping characters.
But the number doesn't say anything about the species we are dealing with. While 'Exporting selection' can reduce strain information 'Change strain label' can add information.
Choose 'Change strain label' in PhyloMain
As you can see, information can be added to a fasta file, but also to a tree file. We are going to use a fasta file.
It also needs an Excel file with the information we want to add.
Information of 3 strains is missing, that we added to our alignment; one is deleted, CBS_305.60 and is already present in the file as is CBS_730.88.
Open D:\PhyloMain\Examples\All_Dermas_Data.xlsx in Excel
Go to the end of the list and add the following lines in column ID and Genus respectively:
IHEM_1279 Trichophyton violaceum
IHEM_19041 Trichophyton violaceum
IHEM_4711 Trichophyton violaceum
Save and close the Excel file.
Open Fasta file: D:\PhyloMain\Examples\ALL_Dermas_gb_mafft_L_CBS.afa
Open Excel file: D:\PhyloMain\Examples\All_Dermas_Data.xlsx
Save converted file: D:\PhyloMain\Examples\ALL_Dermas_gb_mafft_L_CBS_convert.fas (default name)
Genus/species name will be added to our sequences as default. 'Add extra info to labels from the Excel file' can be used to add the information from column 3 and further in Excel to the fasta file. In general this is used for tree files.
You have the option to open the converted file by checking the box.
Click 'Start conversion'
When finished it will show the location of the created file.
Gblocks eliminates poorly aligned positions and divergent regions of a DNA or protein alignment so that it becomes more suitable for phylogenetic analysis.
Gblocks can be started when an alignment is created by checking the GBlocks checkbox or use 'GBlock' in PhyloMain in an already created alignment.
Parameter values are difficult to guess; there are stringent and less stringent parameter values.
'Maximum Number of Contiguous Nonconserved Positions': the higher the number the more positions will be added to the final alignment (less stringent).
'Minimum Length of a Block': the higher the number the less positions will be added (more stringent).
'Allowed Gap Positions': there are three values, None, With half, All (how gaps are treated, more->less stringency).
None: all sites that contain gaps will be eliminated.
With half: sites that contain >50% gaps will be eliminated.
All: sites with gaps are allowed and treated as a separate value.
Unfortunately there is a downside to this program: GBlocks CAN NOT handle '>' lines larger than 50 characters.
Select GBlocks in PhyloMain
Open Fasta alignment: D:\PhyloMain\Examples\ALL_Dermas_gb_mafft_L_CBS.afa
Use the default parameter values: 8, 10, None
Click 'Start Gblocks' and 'Open Gblocks alignment' when finished
The Gblocks alignment is saved with the original name + .gb --> D:\PhyloMain\Examples\ALL_Dermas_gb_mafft_L_CBS.afa.gb
The length of the alignment is reduced to 293 sites and contains no gaps.
The same directory also contains an html file, that shows the sites that are used and eliminated to create this alignment.
You can open the file in the default browser by double-clicking the file D:\PhyloMain\Examples\ALL_Dermas_gb_mafft_L_CBS.afa.htm
Play around with the parameters and see what happens to the alignment results.
Finally, we created the best possible alignment. This aligned file can be used to create cladograms.
As with aligning, that has numerous programs to achieve this goal, there are also numerous programs available to create cladograms or phylogenetic trees.
There is difficulty in choice here too. What software should we use to obtain a tree with our dataset?
This also depends of the contents of our dataset. If the alignment contains very similar sequences it is no use to choose very complex tree construction programs.
It is like using a canon to kill a fly. It is time consuming and not very efficient. According to computer speed this is not really an issue anymore, because computers are relatively fast.
How to choose the right program? What do you want to achieve? A guide line to which program to use is shown in this picture.
Maximum parsimony: MEGA and PAUP are two software packages that are able to create these trees.
Distance methods: most tree constructing software is able to create trees based on distance matrices, MEGA, PAUP, FastTree, Phylip.
Maximum likelihood: these trees are made by RaxML, MrBayes, IQ-tree.
In PhyloMain we can use FastTree, IQ-tree, RaxML, PAUP and MrBayes to construct phylogenetic trees.
Creating maximum likelihood or parsimony trees is much slower than trees based on a distance matrix, like neighbor joining trees.
Distance methods are based on differences between the different sequences. All sequences are compared with each other resulting in a matrix.
A | B | C | D | |
A | - | 17 |
21 | 27 |
B | - | 12 |
18 |
|
C | - | 14 |
||
D | - |
Based on these distances a tree can be created.
Maximum parsimony is trying to create the best possible tree out of all possible trees where the fewest required 'mutation events' are needed.
It uses only informative sites in an alignment, i.e. sites should have at least two or more different nucleotides and when there are only two then at least two or more sequences should have this nucleotide.
Seq1 ..AGCTAAAGGGTCAGGGGAAGGGCA..
Seq2 ..AGCAAAAGGGTCAGGGGAAGGGGA..
Seq3 ..AGCATAGGGGTCAGGGGAAAGGCT..
Seq4 ..AGCGGACCGGTAAGGAGAAAGGAC..
info ** * * **
With this method it is possible to obtain more than 1 tree that has the fewest 'mutation events'; which tree is the best tree can not be decided. In general a consensus tree is created with all 'best' trees with at least one polytomy, i.e. a tree with a branch that is not bifurcating. But an incompletely resolved tree is better than an incorrect tree.
Maximum likelihood is using all available sites and a lot of parameters based on a substitution model. Without proper substitution model a ML tree can not be created. Parameters, that are used, are e.g. tree and branch length, ratio of GC, GA, GT, AC, AT, TC, frequency of A, C, G, C etc.
Let's create a tree with the LSU33_muscle.afa alignment, that we created in the alignment course.
The alignment shows very similar sequences with conserved regions. We use 'FastTree', because IQ-tree, RaxML or MrBayes will not give a better tree and will take more computer time. FastTree was created to handle very large datasets.
Select 'FastTree (NJ)' in PhyloMain
Open Fasta file: D:\PhyloMain\Examples\LSU33_muscle.afa (it must be an alignment or it will throw an error)
Save FastTree As: default value; this is the name of the alignment file, but with extension .tre
If you want the tree to be opened, check box 'View tree in FigTree'
Click 'Start FastTree'
If 'View tree in FigTree' was checked, an information window will open. The tree file contains values for the node/branches and the default name is 'label'. You can change this into 'support' for example, but is not really necessary.
If it wasn't checked, you can open the tree by selecting 'Open existing tree' in PhyloMain.
Select 'Open existing tree' in PhyloMain
The filebrowser window will open; choose directory and file D:\PhyloMain\Examples\LSU33_muscle.tre
The tree will open in FigTree; click OK to use the default 'label' name
Press 'Ctrl-MU' to view the tree in a balanced representation
Check 'Branch label' on the left side menu and open this menu by clicking the little triangle shape
At 'Display:' choose 'label' (default is 'Branch times')
At 'Format:' chose 'Percent' (default is 'Decimal')
The tree only shows CBS numbers and no information about genus/species.
We are going to add information to the branches with data that is stored in an Excel file: LSU33_Data.xlsx
Select 'Change strain label' in PhyloMain
Open Tree file: D:\PhyloMain\Examples\LSU33_muscle.tre
Open Excel file: D:\PhyloMain\Examples\LSU33_Data.xlsx
Save converted file: use default value LSU33_muscle_convert.nwk
The Excel file contains 5 columns: 2 mandatory columns with StrainID and Genus and 3 optional columns.
Click on the combobox to see what these other columns are. Here you can choose what extra information you want to add to the tree labels. The order in which you choose the information will also be the order the information is added to the label. When you choose more than 1, you can choose a piping character with which the information is separated. The default value is '/', but you can also change this in a comma or semi-colon etc.
Select 'Source' then 'Country'
As piping character type ;
Check 'Open converted tree file' to view the tree in FigTree
Click 'Start conversion'
Resulting tree
How to create such a datafile? Although it goes beyond the scope of this course, the following excercise uses BioEdit and Excel to create such a file relatively fast. I would assume, that you already have a list with strains you are working whether in Word or Excel.
In Excel you can manipulate data in many ways, but that is a totally different course.
Open the data-template in Excel: D:\PhyloMain\Examples\Data_template.xlsx
The 'green' columns are mandatory data, while the 'yellow' columns can be any data you want to add; the headings of these columns can be adapted to your needs.
Open D:\PhyloMain\Examples\LSU33.fas in BioEdit (the program is located in C:\Users\Public\Progs\BioEdit)
Press Ctrl-A to select all sequences
Edit - Copy sequence titles (or Shift+Ctrl+C)
Go to the opened Excel file and click in cell A2, then Ctrl-V to paste the clipboard contents
Suppose the sequence file contains '>' lines with more information separated by e.g. the piping character |
This information can easily be transferred to Excel.
Open D:\PhyloMain\Examples\Clades_align_nodup_wsl.fas in BioEdit
Repeat the same procedure: Ctrl-A, Shift-Ctrl-C
Open Notepad and paste the contents, save file as e.g. titles.txt
In Excel menu: File - Open
Select the saved textfile 'titles.txt'
To import choose Next, Delimited, Next, Other and type | , Finish
Now you can add headers and columns or delete columns you don't need.
A more complex tree construction program is IQ-tree. This program has much more parameter settings and creates maximum likelihood trees. According to the flowchart already mentioned, this program is often used with a dataset that contains sequences with unrecognizable similarities.
Start IQ-tree by selecting 'Create IQ-tree' in PhyloMain.
This is the IQ-tree window:
Model criterion
IQ-tree creates maximum likelihood trees and therefor a substitution model is necessary. You can select your own substitution model by checking the box and choosing the model. Otherwise have the program calculate the best possible model based on the dataset; BIC (Bayesian Information Criterium) is the default algorithm and provides the easiest possible model. You can change this to another algorithm: AIC or AICc, Akaike Information Criterium or corrected Akaike Information Criterium.
IQ-tree parameters
Max. iterations doesn't have to be changed. IQ-tree seldomly uses all 1000 iterations to create a tree. The other options will establish values for the support of the branches in the tree.
The 'old' Nonparam BS (bootstrapping), that is also available in other programs, is slower than Bootstraps developed for IQ-tree (called UF-boot) and can't be used at the same time. Bootstrapping can be combined with aLRT iterations and Bayesian inference. When all three are used, the values will be shown in the tree. UF-boot should have a value >=1000, while non-parametric BS can have less iterations, 100 but preferably 500.
Note: when using Nonparametric BS it is assumed that values >70% are a good support for a branch. Due to the algorithm of UF-boot a well supported branch should have at least a value of 95%.
The use of a 'partition file' is only necessary in concatenated sequences of different markers. Different models of substitution can be assigned to the different markers in the alignment, for single marker alignment we don't have to create a partition file.
The rest of the parameters can be used to do extra tests or when a dataset contains a lot of shorter sequences. 'Prevent overestimating branch support' can be checked, if there is a possibility that the dataset will violate the conditions of the chosen model of substitution, i.e. the model parameters doesn't totally agree with the dataset, but there is no better model.
Don't change any of the 'Tree building parameters'
Open Fasta file: D:\PhyloMain\Examples\LSU33_muscle.afa (this must be an aligned fasta file, otherwise it will throw an error)
Save IQ-tree As: default value (it will use the filename, but adds _IQ to the name to distinguish it from other tree files)
If you check 'View tree in FigTree' it will open the tree after construction.
Click 'Build tree'
Open also the tree created with FastTree: D:\PhyloMain\Examples\LSU33_muscle.tre
Which tree is better, i.e. more resolved tree?
The IQ-tree doesn't show branch support, like the FastTree tree. Create a tree with IQ-tree with branch support.
Change the name of the saved file to LSU33_muscle_support_IQ to create a new tree
Put check marks on aLRT iterations, Bootstraps and Bayesian inference; keep the default values
Click 'Build tree'
The labels in the tree are shown in aLRT/BI/BS order and values like 100/1/100 represents a reliable, well supported branch by all three methods.
RaxML can handle also very large datasets, but is very computer-intensive. The larger the dataset, the more power, memory and time is needed to build a phylogenetic tree.
Furthermore, RaxML is very sensitive when it comes to dataset formats. It use relaxed interleaved or sequential Phylip format or fasta format. The dataset has to be examined and edited to avoid the following problems:
1. Identical sequence name(s) appearing multiple times in an alignment, this can easily happen when you export a standard PHYLIP file from some tool which truncates the sequence names to 8 or 10 characters.
2. Identical sequence(s) that have different names, but are exactly identical. This mostly happens when you excluded some hard-to-align alignment regions from your alignment and does not make sense to use.
3. Undetermined column(s) that contain only ambiguous characters, that will be treated as missing data, i.e. columns that entirely consist of X, ?, *, for AA data and N, O, X, ?, for DNA data.
4. Undetermined sequence(s) that contain only ambiguous characters (see above) that will be treated as missing data.
Prohibited character(s) in taxon names are names, that contain any form of whitespace character, like blanks, tabulators, and carriage returns, as well as one of the following prohibited characters: colon, semi-colon, () or [].
RaxML can run parallel calculations (threads) depending on the amound of cores in your computer. How many threads you should use depends also on the size of the dataset, i.e. length of the alignment.
As a rule of thumb, one core/thread should be used for every 500 DNA sites. An alignment of 1000 sites will efficiently run on 2 threads, more threads might even decrease efficiency.
Select 'Run RaxML' in PhyloMain
We will use LSU33_mafft.afa alignment. If it wasn't created yet, do so. This alignment has less than 500 sites and doesn't need parallel runs (without threads).
Open Fasta file: D:\PhyloMain\Examples\LSU33_maftt.afa
Save tree As: (default value) D:\PhyloMain\Examples\LSU33_mafft.tre
If you want to view the tree after calculation, check 'View tree in FigTree'
Click 'Build tree'
RaxML will start bootstrapping until it reaches convergence (until bootstrap likelihood can not be improved). With this dataset it needs around 500 bootstrappings.
After that it will search for the maximum likelihood tree.
Bootstrapping is a commonly used procedure to get an insight in the strength/support of a clade. A value of 100(%) means the clade is strongly supported. Discussions about the minimum value of the bootstrap to decide whether a clade is a true clade is still going on. Some say it should be at least 80%, others agree on 70% or higher. Nevertheless, lower values doesn't always mean the clade is 'wrong'.
How does it work?
As an example we use this small alignment of 5 strains of 60 sites. The tree program will randomly choose 60 sites to create a new alignment set for the first bootstrap and creates a tree. Because the selection is random, some sites can be selected multiple times. This is repeated the amount of bootstraps that was set by the user, e.g. 1000. Mind you, this has nothing to do with the function of the gene or part of the gene, that is sequenced.
You can imagine, if the sequence of the strains are exactly the same, the order of nucleotides doesn't matter and will always create the same clade in these 1000 trees and will have a support value of 100. If however, the 'new' alignment causes some of the strains to be reallocated in a different clade the value will drop. At the end a consensus tree is created with the bootstrap values for each node.
To create a parsimony bootstrap tree we use PAUP. This is a stand-alone program and will be started from within PhyloMain, but in a separate window.
PAUP is not using fasta-formatted sequence files, but in nexus-format. The conversion program in PhyloMain can convert sequence files.
The construction of the tree can be done in 3 ways: entering command lines in the PAUP window, via PAUP menus or add a PAUP-block at the end of the nexus file.
First we will start by using the command line facility in PAUP.
Choose 'Convert sequence format' in PhyloMain
The default values for Input and Output is what we need.
Open the input file: D:\PhyloMain\Examples\LSU33_muscle.afa
Save the output file: D:\PhyloMain\Examples\LSU33.nex
Click 'Start conversion'
Unfortunately the conversion program adds a lot of blank lines at the end of the file; these lines can be deleted.
There are also format converters online, like ALTER.
Select 'Start PaupUp' in PhyloMain (a separate window with a condensed version of PAUP will open)
For convenience you can make the window a little bit larger.
We will log all commands and result output to a file:
File - Save log as -> lsu33_bootstrap.log
File - Open -> lsu33.nex
Make sure in 'Analysis' Parsimony is checked
At the command: line type the following commands
bootstrap nreps=200 treefile=boot.tre search=heuristic/ start=stepwise addseq=random nreps=10 swap=TBR <enter>
Because the limit of trees is set to 100, the program will stop execution to obtain input from the user.
We are going to increase 'MaxTrees' by typing Y in the command line and enter a new value, 900.
Now we are offered three options; the best option would be (2) but will increase calculation time. For the sake of the course we will choose 3 (Leave unchanged, and don't prompt)
After the calculation has finished type the following to save the tree, otherwise it will be deleted from memory:
savetrees file=bootMajRule.tre from=1 to=1 savebootp=nodelabels <enter>
There are some necessary commands in this line: from/to is important to tell what trees to save, even if there is only one tree instance in memory. To be able to see the bootstrap proportions in e.g. FigTree we have to add 'savebootp=nodelabels'.
Get the previous boot.tre into memory: gettrees file=boot.tre StoreTreeWts=yes mode=3 <enter>
and create a consensus tree: contree all/strict=no majrule=yes usetreewts=yes treefile=bootMajRuleCon.tre <enter>
Note: all tree files are saved in the directory where PhyloMain started, in our case D:\PhyloMain.
Not everything in PAUP can be covered during this course, but if you are interested, check out this website.
The previous commands can be written in the lsu33_muscle.nex at the end of the file.
When the new lsu33_muscle.nex file is opened in PAUP, it will immediately execute these lines.
begin paup;
set maxtrees=900 increase=no;
bootstrap nreps=200 treefile=boot.tre search=heuristic/ start=stepwise addseq=random nreps=10 swap=TBR;
savetrees file=bootMajRule.tre from=1 to=1 savebootp=nodelabels;
gettrees file=boot.tre StoreTreeWts=yes mode=3;
contree all/strict=no majrule=yes usetreewts=yes treefile=bootMajRuleCon.tre;
end;
The file 'carrionii.nex' contains also a PAUP execution block and a SETS block. There are two sequence markers concatenated, ITS and beta-tubulin. These two markers are defined in the SETS block. Then 'hompart' command is used to start a homogeneity partition test which will tries to find incongruency between the two markers; if there is, these two markers should not be combined to make a phylogenetic tree, because they might have a different evolutionary rate.
begin sets;
charset ITS = 1-549;
charset BT2 = 550-938;
charpartition 2genes=ITS:1-549, BT2:550-938;
end;
begin paup;
set maxtrees=100 increase=no;
log file=carrionii.log;
hompart partition=2genes nreps=100 / start=stepwise addseq=random nreps=10 savereps=no randomize=addseq rstatus=no hold=1 swap=tbr multrees=yes;
log stop;
end;
Such a file can be created in PhyloMain by selecting 'Create partition file'.
Following the 'carrionii.nex' example:
Number of genes concatenated: 2
Marker name -> ITS from 1 to 549 -> Add Item
Marker name -> BT2 from 550 to 938 -> Add Item
Check 'Add charpartition' (we are going to use it for PAUP) and name it '2genes'
Because we are not going to add this to the carrionii.nex file, because it is already there, we are only going to save the partition file.
Click 'Save partition file' and name it 'carrionii_part.nex'
This is a standalone program and will open in a separate window. It allows us to concatenate different markers to one sequence. It is programmed in Java and therefor a java runtime library should be installed on your computer.
Leading and trailing gaps will automatically be converted to ?
It is important, that the ID of strains ('>' line) in all sequence files are the same, otherwise the program can't concatenate them and will create a new entry.
>CBS 14054 is not the same as >CBS14054 or >CBS_14054 or >CBS 14054 Candida albicans.
During concatenation the program will inform us about the presence of a strain in the different sequence files. If not it will print (No data).
In the Examples directory there are no datasets for different genes, that can be concatenated. Extract 'sequencematrix.zip' in the Examples directory or download the following zip-file and extract it into the Examples directory. Three .afa files will be added to the directory, BT2.afa, EF1.afa and ITS.afa.
Watch this Youtube submission:
Select 'SequenceMatrix' in PhyloMain
Select BT2.afa, EF1.afa and ITS.afa by holding down the Ctrl-key.
Drag the selected files to the SequenceMatrix window.
SequenceMatrix is exporting the concatenated file in nexus-format.
There are 3 options:
1. Export sequences as Nexus (interleaved, 1000bp)
2. Export sequences as Nexus (non-interleaved)
3. Export sequences as Nexus ("naked", e.g. for GARLI)
Option 3 will export the sequences without extra information. Options 1 and 2 will add information of the datasets that were concatenated.
This is very useful, if we want to create a partition file with begin and end values of the different genes, so we don't have to calculate these ourselves.
Export the sequences with option 1, open the created file in a text editor like Notepad and scroll to the end of the file.
There you will see the definition of the sets.
CHARSET BT2 = 1-392;
CHARSET EF1 = 393-586;
CHARSET ITS = 587-1188;
MatGAT is a standalone program and will open in a separate window. It compares all sequences with each other and calculates the similarity and identity values. The table with values can be exported to Excel. Unfortunately, the colouring of the cells like shown in MatGAT is not exported. With a set of macro's in temp0.xlsm it is possible to add colours and some other formatting.
Select 'Similarity/Identity' in PhyloMain
How are similarity and identity calculated?
Suppose we have these 2 sequences aligned:
A: AAGGCTTCAGCTA
B: AAGGC--CTG-TA
Identity: nr. of identical nucleotides / shortest sequence length = 9/10 = 1 (90%)
Similarity: 1 - (nr. of differences / shortest sequence length) = 1 - (4/10) = 0.6 (60%)
We have 682 guests and no members online