TriFusion features

Take a look at the three main sections of TriFusion and discover what it can do for you. Scroll away!

Overview

TriFusion is a modern GUI and command line application designed to make the life of anyone with proteome and/or alignment sequence data easier and more pleasurable. Regardless of your experience in bioinformatics, TriFusion is easy to use and offers a wide array of powerful features to help you deal with your data. At the same time, it was developed to handle the enormous amount of data that is generated nowadays.

Fast and efficient

User friendly

CLI version available

Open Source


Orthology

Search orthologs across proteomes

Searching for orthologs across multiple proteomes can be a daunting and complex task. Current pipelines are great, but they often require a long and tedious setup and/or execution processes. TriFusion handles all of that for you, while providing a user-friendly interface that leverages the most popular framework for finding orthologs, OrthoMCL.

No fuss. No complications. Searching for orthologs in TriFusion is as simples as it gets.

Orthology search demo

Explore and filter your orthologs

So, you have completed your search for orthologs. Now what?

The Explore section of TriFusion's orthology module was designed to let you easily explore your ortholog search results. You can filter orthologs groups according to gene copy number, taxon taxon and/or taxon representation. You can also check the effect that these filters have on the final number of ortholog groups using a suite of interactive plot screens.

Export ortholog groups into protein/nucleotide files

The final step of an ortholog search operation is usually the generation of sequence files for each potential ortholog group. TriFusion lets you apply multiple filters on your ortholog groups and export only the groups that you are interested in.

But that's not all. If you provide the CDS files corresponding to each proteome that you used in your search, you can also export those ortholog groups directly into nucleotide sequences. And don't worry if the headers of the sequences in the CDS and proteome files don't match. TriFusion searches for sequence equality, not header equality, so you can be sure that the sequences are always correctly converted.

Export orthologs demo

Process

Blazing fast alignment conversion/concatenation

Designed from the ground up to handle the massive number of alignments that are currently being processed in phylogenomic analyses, TriFusion is able to convert and concatenate thousands of alignment files from several input formats into one or more popular output formats (check the supported formats).

Easily load your data and get the desired converted/concatenated output in seconds.

Concatenation/Conversion demo

High performance and efficiency

So you have a 1Gb alignment file? Or perhaps a set of 7,000 individual alignment files. Don't panic, TriFusion can handle it! With the inclusion of high performance libraries and several efficient programming techniques, even the largest data sets can be processed in seconds or a few minutes. Here's a sample of the run time and RAM consumption benchmarks for the main operations of TriFusion for different types of data sets.

DS1 DS4 DS5 DS9
Files 614 3093 7378 1
Taxa 48 376 29 376
Total bases ~350k 1.25M ~601k 1.25M
Parsing
Alignment reading 0.8s / 13.9Mb 14.5s / 169.5Mb 13.6s / 117.6Mb 2.9s / 14.5Mb
Main operations
Conversion (nexus) 0.5s / 0.3Mb 17.0s / 0.6Mb 93.7s / 1.0Mb 2.4s / 22.8Mb
Conversion (interleave) 2.4s / 0.4Mb 49.9s / 0.6Mb 110.1s / 1.0Mb 74.9s / 17.0Mb
Concatenation 4.7s / 5.1Mb 554.4s / 21.6Mb 211.2s / 7.1Mb NA

You can check the complete benchmarks table to see in detail how particular operations of TriFusion fare in terms of run time and memory usage across a wide range of datasets.

Collapse identical sequences

With the Collapse operation, you can collapse identical sequences into the same haplotype, creating an alignment that contains only unique sequences.

Here's a demonstration of how Collapse transforms an alignment:

Example:

# Four taxon alignment where there are only two unique sequences
>TaxonA
AAAAAA
>TaxonB
AAAAAA
>TaxonC
TTTTTT
>TaxonD
TTTTTT
              
# Collapsed alignment with only the two unique sequences
>Hap1
AAAAAA
>Hap2
TTTTTT
              

A .haplotypes files is generated alongside the collapsed alignment with the correspondence between haplotype names and the original taxon names.

Collapse alignments demo

Create consensus sequences from alignments

With the Consensus operation, you can create a consensus sequence for each alignment individually or for the concatenated alignment. There are several methods available on how to deal with sequence variation.

Options for sequence variation handling during consensus
IUPAC Soft mask Remove First sequence
(Nucleotide sequences only) Converts variable sites using the IUPAC nucleotide ambiguity code Replaces variable sites with the missing data symbol Removes variable sites from the consensus An arbitrary method that simply uses the first sequence without modification

Optionally, consensus sequences from multiple alignments can be merged into a single output file, ready for downstream analyses. Here's an example of how consensus sequences are created for three input alignments.

Example:

# Alignment1.fas
>TaxonA
ATAAAA
>TaxonB
AAAAAA
>TaxonC
AAAAAG
>TaxonD
AAAAAA
                
# Alignment2.fas
>TaxonA
TATTTT
>TaxonB
TTTTTC
>TaxonC
TTTTTT
>TaxonD
TTTTTT
                
# Alignment3.fas
>TaxonA
GAGGGG
>TaxonB
GGGGGG
>TaxonC
GGGGGG
>TaxonD
GGGGGG
                

Create individual consensus masking sequence variation

# Alignment1_consensus.fas
>consensus
ANAAAN
                
# Alignment2_consensus.fas
>consensus
TNTTTN
                
# Alignment3_consensus.fas
>consensus
GNGGGG
                

Optionally, merge all consensus into a single file

# consensus.fas
>Alignment1_consensus
ANAAAN
>Alignment2_consensus
TNTTTN
>Alignment3_consensus
GNGGGG
                

Create consensus demo

Filter alignments by taxa

Alignments can be filtered depending on whether they contain or excluded user-defined taxa groups. These taxa groups can be easily defined in TriFusion or imported from text files and can represent any number of samples in the data set. For instance, in the example below, we filter out alignments that do not contain both TaxonA and TaxonD, resulting in only 2 out of 4 alignments being further processed.

Example:

# Alignment1.fas
>TaxonA
AAAAAA
>TaxonB
TTTTTT
>TaxonC
CCCCCC
>TaxonD
DDDDDD
                  
# Alignment2.fas
>TaxonA
AAAAAA
>TaxonB
TTTTTT
                
# Alignment3.fas
>TaxonB
TTTTTT
>TaxonC
CCCCCC
>TaxonD
DDDDDD
                
# Alignment4.fas
>TaxonA
AAAAAA
>TaxonD
DDDDDD
                  

Preserving only alignments that contain TaxonA and TaxonD

# Alignment1.fas
>TaxonA
AAAAAA
>TaxonB
TTTTTT
>TaxonC
CCCCCC
>TaxonD
DDDDDD
                  
# Alignment4.fas
>TaxonA
AAAAAA
>TaxonD
DDDDDD
                  

Filter by taxa demo

Filter alignments by codons

Alignment columns can be filtered according to their codon positions in any combination. This is quite useful when you wish to remove, for example, saturated codon positions for downstream analyses. However, note that, for now, TriFusion will filter the same codon positions for each input alignment.

Filter by codons demo

Filter alignments by missing data/genes

Filtering alignments and alignments columns according to missing taxa and missing data, respectively, is one of the most common processing steps when dealing with large amounts of data.

Filter missing data within alignments

1. Replace gaps at the extremities

The first step when filtering alignment columns for missing data is to replace the gap symbols at the extremities of the alignment by missing data symbols. Gaps at the extremities of the alignment are usually introduced by alignment programs but they actually represent missing data. The distinction may be important for several downstream analyses, for filtering according to gaps/missing data or for coding gaps at the end of the alignment matrix.

>TaxonA
--AA-AA--A-
>TaxonB
-AAAA--AAAA
>TaxonC
AAA--AAAA--
>TaxonD
AAA--AA-AAA
                  
>TaxonA
NNAA-AA--AN
>TaxonB
NAAAA--AAAA
>TaxonC
AAA--AAAANN
>TaxonD
AAA--AA-AAA
                  

2. Filter columns with excessive gaps/missing data

You may want to filter your alignments in order to reduce the amount of columns with excessive missing data, or just plainly remove all missing data from the alignments. TriFusion allows you to set two independent thresholds for gaps and missing data and then removes all alignment columns that exceed those thresholds.

# Filter columns with max gap 50%
# and max missing data 50%
>TaxonA
NNAA-AA--AN
>TaxonB
NAAAA--AAAA
>TaxonC
AAA--AAAANN
>TaxonD
AAA--AA-AAA
                  
# Two columns filtered
>TaxonA
NAAAA--AN
>TaxonB
AAA--AAAA
>TaxonC
AA-AAAANN
>TaxonD
AA-AA-AAA
                  

Filter alignments with low taxa coverage

When you have multiple files you may want to filter alignments that have low taxa coverage in order to produce more dense alignment matrices. This can be easily accomplished by setting a minimum taxa representation proportion that alignments will have to pass to be further processed. This can be used in combination with the taxa filter option for more fine grained control on the resulting alignment matrices.

  # 6 taxa file
  >TaxonA
  AAAAAA
  >TaxonB
  AAAAAA
  >TaxonC
  AAAAAA
  >TaxonD
  AAAAAA
  >TaxonE
  AAAAAA
                  
  # 3 taxa file
  >TaxonA
  AAAAAA
  >TaxonD
  AAAAAA
  >TaxonE
  AAAAAA
                  
  # 2 taxa file
  >TaxonA
  AAAAAA
  >TaxonB
  AAAAAA
                  
  # 5 taxa file
  >TaxonA
  AAAAAA
  >TaxonB
  AAAAAA
  >TaxonC
  AAAAAA
  >TaxonD
  AAAAAA
  >TaxonE
  AAAAAA
                  

Minimum taxa representation of 50%

  # 6 taxa file
  >TaxonA
  AAAAAA
  >TaxonB
  AAAAAA
  >TaxonC
  AAAAAA
  >TaxonD
  AAAAAA
  >TaxonE
  AAAAAA
                  
  # 5 taxa file
  >TaxonA
  AAAAAA
  >TaxonB
  AAAAAA
  >TaxonC
  AAAAAA
  >TaxonD
  AAAAAA
  >TaxonE
  AAAAAA
                  

Filter by missing data demo

Filter alignments by sequence variation

Alignments can be filtered according to the number of variable and/or parsimoniously informative columns. You can specify a range of acceptable values for either metric and keep only the alignments with the type of variation that you are interested in.

Example:

# 4 variable sites
# Alignment1.fas
>TaxonA
AAAAAA
>TaxonB
ATAATA
>TaxonC
TAAAAA
>TaxonD
AAAAAT
                  
# 1 variable site
# Alignment2.fas
>TaxonA
AAAAAA
>TaxonB
AAAATA
>TaxonC
AAAAAA
>TaxonD
AAAAAA
                  
# 0 variable sites
# Alignment3.fas
>TaxonA
AAAAAA
>TaxonB
AAAAAA
>TaxonC
AAAAAA
>TaxonD
AAAAAA
                  
# 3 variable sites
# Alignment4.fas
>TaxonA
ATAATA
>TaxonB
AAAAAA
>TaxonC
AAAAAA
>TaxonD
AAAAAT
                  

Filter alignments w/ less than 4 var sites

# 4 variable sites
>TaxonA
AAAAAA
>TaxonB
ATAATA
>TaxonC
TAAAAA
>TaxonD
AAAAAT
                  

Code gaps into the alignment matrix

If you suspect that the indel patterns in your alignments may contain valuable phylogenetic information, TriFusion has the option to code your indel patterns as binary states at the end of the alignment matrix and creates a well formatted Nexus file ready for downstream analysis with programs such as MrBayes. If you are concatenating multiple alignments, the binary matrix will be concatenated as well.

Gap coding demo

Reverse a concatenated alignment

If you have a concatenated alignment and a partitions file, you can easily export those partitions as individual alignment files. More generally, TriFusion allows you to export subsets of any given alignment that are defined in a provided partitions file OR defined within TriFusion - even codon partitions!

Example:

If you have a 1,000bp alignment and wish to export 4 partitions of equal length as individual alignment files, you just need to write a partitions file like this (Alternatively you could also define those partitions inside TriFusion - see this tutorial):

# Export these 4 partitions
charset alignment1 = 1:250;
charset alignment2 = 251:500;
charset alignment3 = 501:750;
charset alignment4 = 751:1000;

The export of codon partitions is also possible using the /3 notation. For instance, to split the first partition in the example above into three codon partitions you could do the following:

# Export these 4 partitions
charset alignment1_1 = 1:250/3;
charset alignment1_2 = 2:250/3;
charset alignment1_3 = 3:250/3;
charset alignment2 = 251:500;
charset alignment3 = 501:750;
charset alignment4 = 751:1000;

Reverse concatenation demo

Combine multiple operations on a single run

All operations of TriFusion's Process can be chained in any way during a single run. For instance, you may want to concatenate your alignments, filter alignments that exclude a given taxa group, remove alignment columns with high missing data and collapse sequences in unique haplotypes. All in the same execution.

In addition, the output of any individual operation can be saved on a new file that is independent from the main output file. For instance, you can select the concatenation and collapse of your alignments, but save the output of the collapse operation on an second file.

For performance reasons all these operations follow a specific order:

  1. Main operation (Concatenation/Conversion/Reverse concatenation)
  2. Filter
    1. Taxa filter
    2. Minimum taxa representation
    3. Taxa filter
    4. Codon filter
    5. Missing data filter
    6. Sequence variation filter
  3. Collapse
  4. Gap coding
  5. Consensus

The video below demonstrates a single run where 141 alignments are concatenated, collapsed, filtered by taxa group, filtered by minimum taxa representation, filtered by missing data and filtered by sequence variation, and the consensus sequence of the concatenated alignment is also generated in an additional output file.

Setup and usage of data set groups

Sometimes it may be desired to perform different tasks on different sets of taxa or alignments.

For instance, you may have a data set of nuclear and mitochondrial alignments and wish to create a concatenated alignment for each set of alignments.

Or perhaps your data set includes several taxonomic groups of interest and you want to create multiple concatenated data matrices that maximize the data for those particular groups of taxa.

To make your life easier when these needs arise, TriFusion allows you to create or import groups of taxa and/or alignment and then rapidly switch between them. The video below demonstrates how groups of files and taxa can be quickly created and then how you can perform different tasks on them.

Data set groups demo

Save frequently used data sets as projects

If you have one or more data sets that you need to process or analyze frequently, TriFusion offers the possibility of saving any set of files currently loaded into the application as a project, so that they can be quickly loaded on future sessions. This can be useful in the same session as well, as a way of quickly switching between different sets of files.

Set partitions and substitution models

Partitions can be easily defined and changed within TriFusion, imported from partitions files, or specified within the alignment file in the Nexus format. You can also set and modify the substitution model for each partition, which can be useful for several downstream analyses. In the case of nucleotide sequences, codon partitions can also be defined in any possible combination and with independent substitution models.

Partitions and models demo

Statistics

Quick summary statistics

As soon as you load your alignment data and enter the Statistics screen, calculation of summary statistics will in the background. When completed, you will able to see several statistics related to general, missing data and sequence variation information averaged across the data set. In addition, you can also see these statistics detailed for each input alignment file in a tabular format with sorting and search functions. Both of these formats can be exported into a .csv file that can be opened in any spreadsheet software.

Summary statistics demo

Plot categories

One of the main challenges of dealing with very large sets of alignment files is the difficulty in getting a good feel for the data, that is, understanding its characteristics and peculiarities.

How much missing data do I have? And sequence variation? How are these metrics distributed across taxa and alignments? Are there any outlier taxa/alignments?

Answering these questions is the main purpose of the Statistics module. There are dozens of plotting options available for you to quickly explore and visualize many aspects of data. You can check the plot gallery to see some examples.

Plotting options

  • General information
    • Distribution of sequence size
    • Proportion of nucleotides or residues
    • Distribution of taxa frequency
  • Polymorphism and variation
    • Pairwise sequence similarity
    • Segregating sites
    • Alignment length/polymorphism correlation
    • Allele frequency spectrum
  • Missing data
    • Gene occupancy
    • Distribution of missing taxa
    • Distribution of missing data
    • Cumulative distribution of missing genes
  • Outlier detection
    • Missing data outliers
    • Segregating sites outliers
    • Sequence size outliers

Plot types

For each of these plotting options, you may have up to three plot types that provide a different focus on the same analysis:

  • Single gene: Plots the information along the alignment length (Sliding window).
  • Per species: Plots the information with the taxa along the x-axis or in triangular matrices.
  • Average: Plots information averaged across the active data set.

Statistics categories demo

Data visualization focusing on single alignments

Some plotting analyses can be performed on individual analyses and generate a sliding window plot with the variation of a given metric along the length of the alignments. For the moment, these plot types are only provided for some Polymorphism and sequence variation plots plot options

Here's an example of the sliding window analyses for individual alignments.

Single gene plots demo

Data visualization focusing on taxa

If you are interested in evaluating the impact of that some metric has on individual taxa use the Taxa plot type, when available. This will focus the analysis to give particular focus on taxa by discriminating them along the x-axis, in a triangular matrix or along a distribution.

Here's an example with the generation of some taxa focused plots.

Taxa focused plots demo

Data visualization focusing on alignments average

If you are interested in getting data set wide trends or distribution use the Average plot type, when available. These plot types will usually generate the averaged distribution of the metric of interest across the active data set.

Here's an example with the generation of some data set wide distributions.

Average plots demo

Fast plot switching

During the exploration endeavour, it's frequent to switch back and forth among several plots. Since some plots may take a while to be generated, depending on the size of the active data set, several techniques were implemented to allow fast switching among plots that have already been generated. In addition, some computations are stored locally on a SQLite database in order to reduce waiting times when changing the active data set.

Fast plot switching demo

Usage of data set groups

Just like in the Process section, you can create or import groups of taxa/files and switch between them to generate the plots. Simply create your groups of interest and then select the ones you want for any plotting options.

Data set groups demo

Export plots as figures or tables

All plots generated in the Statistics module can be exported as a figure (in several graphical formats, including vectorial). Generally, the plot generators were designed to produce high quality and publication ready graphics. However, you may want to tweak, change or add content to these plots. To make life easier for you, you can also export the data used to generate the plot as a .csv files, which can be easily loaded by other graphical libraries and software.

Export plots demo