Take a look at the three main sections of TriFusion and discover what it can do for you. Scroll away!
TriFusion is a modern GUI and command line application designed to make the life of anyone with proteome and/or alignment sequence data easier and more pleasurable. Regardless of your experience in bioinformatics, TriFusion is easy to use and offers a wide array of powerful features to help you deal with your data. At the same time, it was developed to handle the enormous amount of data that is generated nowadays.
Fast and efficient
User friendly
CLI version available
Open Source
Searching for orthologs across multiple proteomes can be a daunting and
complex task. Current pipelines are great, but they often require
a long and tedious setup and/or execution processes. TriFusion handles all of that for you, while providing
a user-friendly interface that leverages the most popular framework for finding orthologs,
OrthoMCL.
No fuss. No complications. Searching for orthologs in TriFusion is as simples as it gets.
So, you have completed your search for orthologs. Now what?
The Explore section of TriFusion's orthology module was designed to let you easily explore your
ortholog search results. You can filter orthologs groups according to gene copy number,
taxon taxon and/or taxon representation. You can also check the effect that these filters have on
the final number of ortholog groups using a suite of interactive plot screens.
The final step of an ortholog search operation is usually the generation of sequence files
for each potential ortholog group. TriFusion lets you apply multiple filters on your ortholog groups
and export only the groups that you are interested in.
But that's not all. If you provide the CDS files corresponding to each proteome that you used in your
search, you can also export those ortholog groups directly into nucleotide sequences. And don't worry
if the headers of the sequences in the CDS and proteome files don't match. TriFusion searches for
sequence equality, not header equality, so you can be sure that the sequences are always correctly
converted.
Designed from the ground up to handle the massive number of alignments that are currently being
processed in phylogenomic analyses, TriFusion is able to convert and concatenate thousands of alignment
files from several input formats into one or more popular output formats (check the supported formats).
Easily load your data and get the desired converted/concatenated output in seconds.
So you have a 1Gb alignment file? Or perhaps a set of 7,000 individual alignment files. Don't panic, TriFusion can handle it! With the inclusion of high performance libraries and several efficient programming techniques, even the largest data sets can be processed in seconds or a few minutes. Here's a sample of the run time and RAM consumption benchmarks for the main operations of TriFusion for different types of data sets.
DS1 | DS4 | DS5 | DS9 | |
Files | 614 | 3093 | 7378 | 1 |
---|---|---|---|---|
Taxa | 48 | 376 | 29 | 376 |
Total bases | ~350k | 1.25M | ~601k | 1.25M |
Parsing | ||||
Alignment reading | 0.8s / 13.9Mb | 14.5s / 169.5Mb | 13.6s / 117.6Mb | 2.9s / 14.5Mb |
Main operations | ||||
Conversion (nexus) | 0.5s / 0.3Mb | 17.0s / 0.6Mb | 93.7s / 1.0Mb | 2.4s / 22.8Mb |
Conversion (interleave) | 2.4s / 0.4Mb | 49.9s / 0.6Mb | 110.1s / 1.0Mb | 74.9s / 17.0Mb |
Concatenation | 4.7s / 5.1Mb | 554.4s / 21.6Mb | 211.2s / 7.1Mb | NA |
You can check the complete benchmarks table to see in detail how particular operations of TriFusion fare in terms of run time and memory usage across a wide range of datasets.
With the Collapse operation, you can collapse identical sequences into the same haplotype,
creating an alignment that contains only unique sequences.
Here's a demonstration of how Collapse transforms an alignment:
Example:
# Four taxon alignment where there are only two unique sequences >TaxonA AAAAAA >TaxonB AAAAAA >TaxonC TTTTTT >TaxonD TTTTTT
# Collapsed alignment with only the two unique sequences >Hap1 AAAAAA >Hap2 TTTTTT
A .haplotypes files is generated alongside the collapsed alignment with the correspondence between haplotype names and the original taxon names.
With the Consensus operation, you can create a consensus sequence for each alignment individually or for the concatenated alignment. There are several methods available on how to deal with sequence variation.
Options for sequence variation handling during consensus | |||
IUPAC | Soft mask | Remove | First sequence |
(Nucleotide sequences only) Converts variable sites using the IUPAC nucleotide ambiguity code | Replaces variable sites with the missing data symbol | Removes variable sites from the consensus | An arbitrary method that simply uses the first sequence without modification |
Optionally, consensus sequences from multiple alignments can be merged into a single output file, ready for downstream analyses. Here's an example of how consensus sequences are created for three input alignments.
Example:
# Alignment1.fas >TaxonA ATAAAA >TaxonB AAAAAA >TaxonC AAAAAG >TaxonD AAAAAA
# Alignment2.fas >TaxonA TATTTT >TaxonB TTTTTC >TaxonC TTTTTT >TaxonD TTTTTT
# Alignment3.fas >TaxonA GAGGGG >TaxonB GGGGGG >TaxonC GGGGGG >TaxonD GGGGGG
Create individual consensus masking sequence variation
# Alignment1_consensus.fas >consensus ANAAAN
# Alignment2_consensus.fas >consensus TNTTTN
# Alignment3_consensus.fas >consensus GNGGGG
Optionally, merge all consensus into a single file
# consensus.fas >Alignment1_consensus ANAAAN >Alignment2_consensus TNTTTN >Alignment3_consensus GNGGGG
Alignments can be filtered depending on whether they contain or excluded user-defined taxa groups. These taxa groups can be easily defined in TriFusion or imported from text files and can represent any number of samples in the data set. For instance, in the example below, we filter out alignments that do not contain both TaxonA and TaxonD, resulting in only 2 out of 4 alignments being further processed.
Example:
# Alignment1.fas >TaxonA AAAAAA >TaxonB TTTTTT >TaxonC CCCCCC >TaxonD DDDDDD
# Alignment2.fas >TaxonA AAAAAA >TaxonB TTTTTT
# Alignment3.fas >TaxonB TTTTTT >TaxonC CCCCCC >TaxonD DDDDDD
# Alignment4.fas >TaxonA AAAAAA >TaxonD DDDDDD
Preserving only alignments that contain TaxonA and TaxonD
# Alignment1.fas >TaxonA AAAAAA >TaxonB TTTTTT >TaxonC CCCCCC >TaxonD DDDDDD
# Alignment4.fas >TaxonA AAAAAA >TaxonD DDDDDD
Alignment columns can be filtered according to their codon positions in any combination. This is quite useful when you wish to remove, for example, saturated codon positions for downstream analyses. However, note that, for now, TriFusion will filter the same codon positions for each input alignment.
Filtering alignments and alignments columns according to missing taxa and missing data, respectively, is one of the most common processing steps when dealing with large amounts of data.
Filter missing data within alignments
1. Replace gaps at the extremities
The first step when filtering alignment columns for missing data is to replace the gap symbols at the extremities of the alignment by missing data symbols. Gaps at the extremities of the alignment are usually introduced by alignment programs but they actually represent missing data. The distinction may be important for several downstream analyses, for filtering according to gaps/missing data or for coding gaps at the end of the alignment matrix.
>TaxonA --AA-AA--A- >TaxonB -AAAA--AAAA >TaxonC AAA--AAAA-- >TaxonD AAA--AA-AAA
>TaxonA NNAA-AA--AN >TaxonB NAAAA--AAAA >TaxonC AAA--AAAANN >TaxonD AAA--AA-AAA
2. Filter columns with excessive gaps/missing data
You may want to filter your alignments in order to reduce the amount of columns with excessive missing data, or just plainly remove all missing data from the alignments. TriFusion allows you to set two independent thresholds for gaps and missing data and then removes all alignment columns that exceed those thresholds.
# Filter columns with max gap 50% # and max missing data 50% >TaxonA NNAA-AA--AN >TaxonB NAAAA--AAAA >TaxonC AAA--AAAANN >TaxonD AAA--AA-AAA
# Two columns filtered >TaxonA NAAAA--AN >TaxonB AAA--AAAA >TaxonC AA-AAAANN >TaxonD AA-AA-AAA
Filter alignments with low taxa coverage
When you have multiple files you may want to filter alignments that have low taxa coverage in order to produce more dense alignment matrices. This can be easily accomplished by setting a minimum taxa representation proportion that alignments will have to pass to be further processed. This can be used in combination with the taxa filter option for more fine grained control on the resulting alignment matrices.
# 6 taxa file >TaxonA AAAAAA >TaxonB AAAAAA >TaxonC AAAAAA >TaxonD AAAAAA >TaxonE AAAAAA
# 3 taxa file >TaxonA AAAAAA >TaxonD AAAAAA >TaxonE AAAAAA
# 2 taxa file >TaxonA AAAAAA >TaxonB AAAAAA
# 5 taxa file >TaxonA AAAAAA >TaxonB AAAAAA >TaxonC AAAAAA >TaxonD AAAAAA >TaxonE AAAAAA
Minimum taxa representation of 50%
# 6 taxa file >TaxonA AAAAAA >TaxonB AAAAAA >TaxonC AAAAAA >TaxonD AAAAAA >TaxonE AAAAAA
# 5 taxa file >TaxonA AAAAAA >TaxonB AAAAAA >TaxonC AAAAAA >TaxonD AAAAAA >TaxonE AAAAAA
Alignments can be filtered according to the number of variable and/or parsimoniously informative columns. You can specify a range of acceptable values for either metric and keep only the alignments with the type of variation that you are interested in.
Example:
# 4 variable sites # Alignment1.fas >TaxonA AAAAAA >TaxonB ATAATA >TaxonC TAAAAA >TaxonD AAAAAT
# 1 variable site # Alignment2.fas >TaxonA AAAAAA >TaxonB AAAATA >TaxonC AAAAAA >TaxonD AAAAAA
# 0 variable sites # Alignment3.fas >TaxonA AAAAAA >TaxonB AAAAAA >TaxonC AAAAAA >TaxonD AAAAAA
# 3 variable sites # Alignment4.fas >TaxonA ATAATA >TaxonB AAAAAA >TaxonC AAAAAA >TaxonD AAAAAT
Filter alignments w/ less than 4 var sites
# 4 variable sites >TaxonA AAAAAA >TaxonB ATAATA >TaxonC TAAAAA >TaxonD AAAAAT
If you suspect that the indel patterns in your alignments may contain valuable phylogenetic information, TriFusion has the option to code your indel patterns as binary states at the end of the alignment matrix and creates a well formatted Nexus file ready for downstream analysis with programs such as MrBayes. If you are concatenating multiple alignments, the binary matrix will be concatenated as well.
If you have a concatenated alignment and a partitions file, you can easily export those partitions as individual alignment files. More generally, TriFusion allows you to export subsets of any given alignment that are defined in a provided partitions file OR defined within TriFusion - even codon partitions!
Example:
If you have a 1,000bp alignment and wish to export 4 partitions of equal length as individual alignment files, you just need to write a partitions file like this (Alternatively you could also define those partitions inside TriFusion - see this tutorial):
# Export these 4 partitions charset alignment1 = 1:250; charset alignment2 = 251:500; charset alignment3 = 501:750; charset alignment4 = 751:1000;
The export of codon partitions is also possible using the /3
notation. For instance,
to split the first partition in the example above into three codon partitions you could do
the following:
# Export these 4 partitions charset alignment1_1 = 1:250/3; charset alignment1_2 = 2:250/3; charset alignment1_3 = 3:250/3; charset alignment2 = 251:500; charset alignment3 = 501:750; charset alignment4 = 751:1000;
All operations of TriFusion's Process can be chained in any way during a single run. For instance,
you may want to concatenate your alignments, filter alignments that exclude a given taxa group, remove
alignment columns with high missing data and collapse sequences in unique haplotypes. All in the same
execution.
In addition, the output of any individual operation can be saved on a new file that is independent from
the main output file. For instance, you can select the concatenation and collapse of your alignments, but
save the output of the collapse operation on an second file.
For performance reasons all these operations follow a specific order:
The video below demonstrates a single run where 141 alignments are concatenated, collapsed, filtered by taxa group, filtered by minimum taxa representation, filtered by missing data and filtered by sequence variation, and the consensus sequence of the concatenated alignment is also generated in an additional output file.
Sometimes it may be desired to perform different tasks on different sets
of taxa or alignments.
For instance, you may have a data set of nuclear and mitochondrial
alignments and wish to create a concatenated alignment for each set of alignments.
Or perhaps your data set includes several taxonomic groups of interest and you want
to create multiple concatenated data matrices that maximize the data for those particular
groups of taxa.
To make your life easier when these needs arise, TriFusion allows you to create or import
groups of taxa and/or alignment and then rapidly switch between them. The video below
demonstrates how groups of files and taxa can be quickly created and then how you
can perform different tasks on them.
If you have one or more data sets that you need to process or analyze frequently, TriFusion offers the possibility of saving any set of files currently loaded into the application as a project, so that they can be quickly loaded on future sessions. This can be useful in the same session as well, as a way of quickly switching between different sets of files.
Partitions can be easily defined and changed within TriFusion, imported from partitions files, or specified within the alignment file in the Nexus format. You can also set and modify the substitution model for each partition, which can be useful for several downstream analyses. In the case of nucleotide sequences, codon partitions can also be defined in any possible combination and with independent substitution models.
As soon as you load your alignment data and enter the Statistics screen, calculation of summary statistics will in the background. When completed, you will able to see several statistics related to general, missing data and sequence variation information averaged across the data set. In addition, you can also see these statistics detailed for each input alignment file in a tabular format with sorting and search functions. Both of these formats can be exported into a .csv file that can be opened in any spreadsheet software.
One of the main challenges of dealing with very large sets of alignment files
is the difficulty in getting a good feel for the data, that is, understanding its characteristics
and peculiarities.
How much missing data do I have? And sequence variation? How are these metrics distributed
across taxa and alignments? Are there any outlier taxa/alignments?
Answering these questions is the main purpose of the Statistics module. There are dozens
of plotting options available for you to quickly explore and visualize many aspects of data.
You can check the plot gallery to see some examples.
For each of these plotting options, you may have up to three plot types that provide a different focus on the same analysis:
Some plotting analyses can be performed on individual analyses and
generate a sliding window plot with the variation of a given metric along the length of the
alignments. For the moment, these plot types are only provided for some Polymorphism and
sequence variation plots plot options
Here's an example of the sliding window analyses for individual alignments.
If you are interested in evaluating the impact of that some metric has on
individual taxa use the Taxa plot type, when available. This will focus
the analysis to give particular focus on taxa by discriminating them along the x-axis, in
a triangular matrix or along a distribution.
Here's an example with the generation of some taxa focused plots.
If you are interested in getting data set wide trends or distribution
use the Average plot type, when available. These plot types will usually generate
the averaged distribution of the metric of interest across the active data set.
Here's an example with the generation of some data set wide distributions.
During the exploration endeavour, it's frequent to switch back and forth among several plots. Since some plots may take a while to be generated, depending on the size of the active data set, several techniques were implemented to allow fast switching among plots that have already been generated. In addition, some computations are stored locally on a SQLite database in order to reduce waiting times when changing the active data set.
Just like in the Process section, you can create or import groups of taxa/files and switch between them to generate the plots. Simply create your groups of interest and then select the ones you want for any plotting options.
All plots generated in the Statistics module can be exported as a figure (in several graphical formats, including vectorial). Generally, the plot generators were designed to produce high quality and publication ready graphics. However, you may want to tweak, change or add content to these plots. To make life easier for you, you can also export the data used to generate the plot as a .csv files, which can be easily loaded by other graphical libraries and software.