trifusion.process.sequence module¶

The sequence module of TriFusion contains the main classes handling alignment sequence data. These are Alignment and AlignmentList. Here follows a brief explanation of how these classes work and how to deal with the sqlite database.

Alignment class¶

The Alignment class is the main interface with single alignment files. It can be viewed as the building block of an AlignmentList object, which can have one or more Alignment objects. It contains all methods and attributes that pertain to a given alignment and are used to retrieve information or modify it. However, it is NOT meant to be used independently, but rather within the context of an AlignmentList object. The data from each alignment is stored in a single sqlite database during the execution of TriFusion or the TriSeq/TriStats CLI programs. The connection to this database is automatically handled by AlignmentList for all Alignment objects included in it. In this way, we can use the AlignmentList class to handle the setup of the sqlite3 database, and focus on single alignment data handling in general in this class.

The main types of methods defined in this class are:

Parsers¶

Parsing methods are defined for each format with _read_<format>:

_read_phylip(): Parses phylip format.

_read_fasta(): Parses fasta format.

_read_nexus(): Parses nexus format.

_read_loci(): Parses pyRAD/ipyrad loci format.

_read_stockholm(): Parses stockholm format.

They are always called from the read_alignment() method, and not directly. When an Alignment object is instantiated with a path to an alignment file, it automatically detects the format of the alignment and keeps that information on the input_format attribute. The read_alignment() method then calls the parsing method corresponding to that format. That information is stored in a dictionary:

parsing_methods = {
    "phylip": self._read_phylip,
    "fasta": self._read_fasta,
    "loci": self._read_loci,
    "nexus": self._read_nexus,
    "stockholm": self._read_stockholm
}

# Call the appropriate method
parsing_methods[self.input_format]()

Each format has its own parsing method (which can be modified directly). To add a new format, it is necessary to add it to the automatic format recognition in autofinder(). Then, create the new parsing method, using the same _read_<format> notation and add it to the parsing_methods dictionary in read_alignment().

New parsers must insert alignment data into a table in the sqlite database. This table is automatically created when the Alignment object is instantiated, and its name is stored in the table_name attribute (see _create_table()):

cur.execute("CREATE TABLE [{}]("
            "txId INT,"
            "taxon TEXT,"
            "seq TEXT)".format(table_name))

To insert data into the database, a taxon id (txId), taxon name (taxon) and sequence (seq) must be provided. For example:

cur.execute("INSERT INTO [{}] VALUES (?, ?, ?)".format(self.table_name),
            (0, "spa", "AAA"))

Data fetching¶

To facilitate fetching alignment data from the database, several generators and data retrieval methods are defined:

iter_columns(): Iterates over the columns of the alignment.

iter_columns_uniq(): Iterates over the columns of the alignment but yields only unique characters.

iter_sequences(): Iterates over each sequence in the alignment.

iter_alignment(): Iterates over both taxon name and sequence in the alignment.

get_sequence(): Return the sequence from a particular taxon.

These should always use the setup_intable() decorator and defined with the table_suffix, table_name and table arguments. For instance, the Alignment.iter_sequences() method is a generator that allows the iteration over the sequences in the alignment and is defined as:

@setup_intable
def iter_sequences(self, table_suffix="", table_name=None, table=None):

When calling these methods, only the table_suffix and table_name have to be provided. In fact, the value provided to the table argument at calling time is ignored. The decorator will check the values of both table_suffix and table_name and evaluate the database table that will be used. This final table name will then be provided as the table argument value within the decorator. In this way, these methods can be called like:

for seq in self.iter_sequences(table_suffix="_collapse"):
    # Do stuff

In this case, the setup_intable() decorator will append the table_suffix to the name of the original alignment table. If Alignment.table_name`="main", then the final table name in this case will be “main_collapse”.

Alternatively, we can use table_name:

for seq in self.iter_sequences(table_name="main_2"):
    # Do stuff

In this case, the final table name will be “main_2”.

If the final table name does not exist, the method falls back to the original table name defined by the table_name attribute.

Alignment modifiers¶

Methods that perform modifications to the alignment are also defined here. These include:

collapse(): Collapses alignment into unique sequences.

consensus(): Merges alignment into a single sequence.

filter_codon_positions(): Filters alignment columns according to codon position.

filter_missing_data(): Filters columns according to missing data.

code_gaps(): Code alignment’s indel patterns as a binary matrix.

For example, the Alignment.collapse() method transforms the original alignment into a new one that contains only unique sequences. An important factor to take into account with alignment modifying methods, is that it may be important to preserve the original alignment data for future operations. In TriFusion, the original alignment must be available at all times since users may perform any number of process executions in a single session. Therefore, all methods that can potentially modify the original alignment need to be decorated with the setup_database() function, and must be defined with at least the table_in and table_out arguments. The decorator and these arguments will work together to determine the database’s table from where the data will be fetched, and to where the modified alignment will be written. For instance, the Alignment.collapse() method is defined as:

@setup_database
def collapse(..., table_in=None, table_out="collapsed",
             use_main_table=False):

If we want to perform a collapse of the original alignment and store the modified alignment in a new table, we could call Alignment.collapse() like:

collapse(table_out="new_table")

The setup_database() decorator interprets table_in=None as an instruction to use the table with the original alignment, and stores the modified alignment in “new_table”.

However, we may want to perform a collapse operation after a previous modification from other method. In that case, we can specify a table_in:

collapse(table_in="other_table", table_out="collapse")

One issue with this approach is that we do not know a priori which operations will be requested by the user nor the order. If one execution performs, say, a consensus and a collapse, the new table should be created in consensus and then used as input in collapse. However, if only collapse is called, then the new table should only be created there. To solve this issue, the setup_database decorator is smart about its arguments. We can create a sequence of operations with the same table_in and table_out arguments:

new_table = "master_table"
if "consensus" in operations:
    consensus(table_in=new_table, table_out=new_table)
if "collapse" in operations:
    collapse(table_in=new_table, table_out=new_table)

In this simple pipeline, the user may perform either a consensus, a collapse, or both. When the first method is called, the setup_database() decorator will check if the table provided in table_in exists. In the first called method it will not exist, so instead of returning an error, it falls back to the original alignment table and then writes the modified alignment to “master_table”. In the second method, table_in already exists, so it fetches alignment data from the “master_table”. This will work whether these methods are called individually or in combination.

When there is no need to keep the original alignment data (in single execution of TriSeq, for instance), the special use_main_table argument can be provided to tell the method to use the original table as the input and output table. If this argument is True, it supersedes any information provided by table_in or table_out:

collapse(use_main_table=True)

Writers¶

Like parsers, writer methods are defined with _write_<format>:

_write_fasta(): Writes to fasta format.

_write_phylip(): Writes to phylip format.

_write_nexus(): Writes to nexus format.

_write_stockholm(): Writes to stockholm format.

_write_gphocs(): Writes to gphocs format.

_write_ima2(): Writes to IMa2 format.

_write_mcmctree(): Writes to MCMCTree format.

They are always called from the Alignment.write_to_file() method, not directly. When the Alignment.write_to_file() method is called, a list with the requested output formats is also provided as an argument. For each format specified in the argument, the corresponding writer method is called. That method is responsible for fetching the data from the database and write it to an output file in the appropriate format. The map between the formats and the methods is stored in a dictionary:

write_methods = {
    "fasta": self._write_fasta,
    "phylip": self._write_phylip,
    "nexus": self._write_nexus,
    "stockholm": self._write_stockholm,
    "gphocs": self._write_gphocs,
    "ima2": self._write_ima2,
    "mcmctree": self._write_mcmctree
}

# Call apropriate method for each output format
for fmt in output_format:
    write_methods[fmt](output_file, **kwargs)

The output_file and a kwargs dictionary are provided as arguments to each of these methods. The kwargs dictionary contains all keyword arguments used when calling write_to_file and each writer method fetches the ones relevant to the format. For instance, in the beginning of the _write_fasta() method, the relevant keyword arguments are retrieved:

# Get relevant keyword arguments
ld_hat = kwargs.get("ld_hat", False)
interleave = kwargs.get("interleave", False)
table_suffix = kwargs.get("table_suffix", None)
table_name = kwargs.get("table_name", None)
ns_pipe = kwargs.get("ns_pipe", None)
pbar = kwargs.get("pbar", None)

In addition to the write_methods dictionary, a dictionary mapping the formats and their corresponding file extensions is also defined:

format_ext = {"ima2": ".txt",
              "mcmctree": "_mcmctree.phy",
              "phylip": ".phy",
              "nexus": ".nex",
              "fasta": ".fas",
              "stockholm": ".stockholm",
              "gphocs": ".txt"}

To add a new output format writer, simply create the method using the _write_<format> notation and include it in the write_methods dictionary, along with the extension in the format_ext variable.

AlignmentList class¶

The AlignmentList is the main interface between the user and the alignment data. It may contain one or more Alignment objects, which are considered the building blocks of the data set. These Alignment objects are bound by the same sqlite database connection.

How to create an instance¶

An AlignmentList instance can be created with a single argument, which is a list of paths to alignment files:

aln_obj = AlignmentList(["file1.fas", "file2.fas"])

Even when no information is provided, an sqlite database connection is established automatically (generating a “trifusion.sqlite3” file in the current working directory by default). However, it is possible and advisable to specify the path to the sqlite database:

aln_obj = AlignmentList(["file1.fas", "file2.fas"], sql_path=".sql.db")

A connection to the database can also be provided when instantiating the class, so it’s perhaps more useful to see how the __init__() works:

def __init__(self, alignment_list, sql_db=None, db_cur=None, db_con=None,
             pbar=None):

    if db_cur and db_con:
        self.con = db_con
        self.cur = db_cur
    elif sql_db:
        self.sql_path = sql_db
    else:
        self.sql_path = "trifusion.sqlite3"

As we can see, if a database connection is provided via the db_cur and db_con arguments, the sql_path is ignored. If no sqlite information is provided, the sql_path attribute defaults to “trifusion.sqlite3”.

Adding alignment data¶

Alignment data can be loaded when initializing the class as above and/or added later via the add_alignments() and add_alignment_files() methods (in fact, the add_alignment_files() method is the one called at __init__()):

aln_obj.add_alignment_files(["file3.fas", "file4.fas"])

The most important aspects when adding alignment data is how each alignment is processed and how errors and exceptions are handled. Briefly, the flow is:

Check for file duplications within the loaded file list. Duplications are stored in the duplicate_alignments attribute.

Check for file duplications between the loaded files and any alignment already present. The duplications go to the same attribute as above.

For each provided file, create an Alignment object. Errors that occur when creating an Alignment object are stored in its e attribute. This attribute is then checked before adding the alignment to the AlignmentList object.

Check for InputError exception. These are malformated files and are stored in the bad_alignments attribute.

Check for AlignmentUnequalLength exception. These are sequence sets of unequal length (not alignments) and are stored in the non_alignments attribute.

Check for EmptyAlignment exception. These are empty alignments and are stored in the bad_alignments attribute.

Check the sequence code of the Alignment object. For now, the AlignmentList object only accepts alignments of the same sequence type (e.g. DNA, protein).

If all of the previous checks pass, add the Alignment object to the alignments attribute.

Update the overall taxa names of the full data set after the inclusion of the new alignments.

Wrapping Alignment methods¶

The majority of the methods defined in the Alignment object can also be accessed in the AlignmentList object. These are defined roughly with the same arguments in both classes so that their behavior is the same. These can be simple wrappers that call the respective Alignment method for each alignment in the alignments attribute. For instance, the change_taxon_name() method is simply:

def change_taxon_name(self, old_name, new_name):

    for alignment_obj in list(self.alignments.values()):
        alignment_obj.change_taxon_name(old_name, new_name)

    self.taxa_names = [new_name if x == old_name else x
                       for x in self.taxa_names]

To avoid duplicating the argument list, the wrapping method can use args and kwargs to transfer arguments. This ensures that if the argument list is modified in the Alignment method, it doesn’t need any modification in the wrapper method. For instance, the write_to_file() method of Alignment accepts a large number of positional and keyword arguments, which would be an hassle to define an maintain in the wrapper method of AlignmentList. So, the write_to_file() method of AlignmentList is simply defined as:

def write_to_file(self, output_format, conversion_suffix="",
                  output_suffix="", *args, **kwargs):

The output_format, conversion_suffix and output_suffix are the only positional arguments required when calling this wrapper. All the remaining arguments are packed in the args and kwargs objects and used normally by the wrapped method.

Exclusive AlignmentList methods¶

Some methods are exclusive of AlignmentList because they only make sense to be applied to lists of alignments (e.g. concatenate()). These have more freedom in how they are defined and called.

Active and inactive datasets¶

Taxa and/or alignments can become ‘inactive’, that is, they are temporarily removed from their respective attributes, taxa_names and alignments. This means that these ‘inactive’ elements are ignored when performing most operations. To change the ‘active’ status of alignments, the update_active_alignments() and update_active_alignment() methods are available. For taxa, the update_taxa_names() method can be used:

# Set only two active alignments
self.update_active_alignments(["file1.fas", "file2.fas"])

# Set only two active taxa
self.update_taxa_names(["taxonA", "taxonB"])

Note that all these modifications are reversible. ‘Inactive’ elements are stored in the shelve_alignments attribute for alignments, and shelved_taxa for taxa.

Updating Alignment objects¶

Some tasks perform changes to core attributes of AlignmentList, but they may also be necessary on each Alignment object. For instance, the remove_taxa() method is used to remove a list of taxa from the AlignmentList object. It is easy to change only the relevant AlignmentList attributes, but this change also requires those particular taxa to be removed in all Alignment objects. For this reason, such methods should be defined in the same way in both classes. Using the remove_taxa() example:

def remove_taxa(self, taxa_list, mode="remove"):

    # <changes to AlignmentList>

    for alignment_obj in list(self.alignments.values()):
        alignment_obj.remove_taxa(taxa_list, mode=mode)

As you can see, the usage is the same for both methods.

Plot data methods¶

The AlignmentList object contains all methods that generate data for plotting in TriFusion’s Statistics screen and TriStats CLI program. However, it’s important to establish a separation between the generation of plot data, and the generation of the plot itself. The AlignmentList methods only generate the data and relevant instructions necessary to to draw the plot. This information is then passed on to the appropriate plot generation functions, which are defined in trifusion.base.plotter. The reason for this separation of tasks is that many different alignment analyses are represented by the same plot.

The complete process of how new plots can be added to TriFusion is described here. In this section, we provide only a few guidelines on what to expect from these methods.

All plot data methods must be decorated with the check_data() decorator and take at least a Namespace argument. In most cases, no more arguments are required:

@check_data
def gene_occupancy(self, ns=None):

The check_data() decorator is responsible for performing checks before and after executing the method. The Namespace argument, ns, is used to allow communication between the main and worker threads of TriFusion.

Additional keyword arguments may be defined, but in that case they must be provided in TriFusion when the trifusion.app.TriFusionApp.stats_show_plot() method is called, using the additional_args argument. This object will be passed to the get_stats_data() function in trifusion.data.resources.background_tasks and used when calling the plot data methods:

if additional_args:
    plot_data = methods[stats_idx](ns=ns, **additional_args)
else:
    plot_data = methods[stats_idx](ns)

The other requirement of plot data methods, is that they must always return a single dictionary object. This dictionary must contain at least one key:value with the “data” key and a numpy.array with the plot data as the value. The other entries in the dictionary are optional and refer to instructions for plot generation. For example, the missing_data_distribution method returns:

return {"data": data,
        "title": "Distribution of missing data",
        "legend": legend,
        "ax_names": ["Proportion", "Number of genes"],
        "table_header": ["Bin"] + legend}

The first entry is the mandatory “data” key with the numpy.array data. The other instructions are “title”, which sets the title of the plot, “legend”, which provides the labels for the plot legend, “ax_names”, which provides the name of the x and y axis, and the “table_header”, which specified the header for the table of that plot.

The allowed plot instructions depend on the plot function that will be used and not all of them need to be specified.

Logging progress¶

The majority of AlignmentList methods support the setup and update of progress indicators that can be used in TriFusion (GUI) and the CLI programs. In the case of TriFusion, the progress indicator is provided via a multiprocessing.Namespace object that transfers information between the main thread (where the GUI changes take place) and the working thread. In the case of the CLI programs, the indicator is provided via a ProgressBar object. In either case, the setup, update and resetting of the progress indicators is performed by the same methods.

At the beginning of an operation, the AlignmentList_set_pipes() method is called:

self._set_pipes(ns, pbar, total=len(self.alignments))

The Namespace object is defined as ns and the ProgressBar is defined as pbar. Usually, only one of them is provided, depending on whether it was called from TriFusion or from a CLI program. We also set the total of the progress indicator. In this case it’s the number of alignments. In case the operation is called from TriFusion using the Namespace object, this method also checks the number of active alignments. If there is only one active alignment, it sets a Namespace attribute that will silence the progress logging of the AlignmentList object and receive the information from the Alignment object instead:

if ns:
    if len(self.alignments) == 1:
        ns.sa = True
    else:
        ns.sa = False

Then, inside the task, we can update the progress within a loop using the _update_pipes() method:

for p, alignment_obj in enumerate(list(self.alignments.values())):
    self._update_pipes(ns, pbar, value=p + 1,
                       msg="Some message")

Here, we provide the Namespace and ProgressBar objects as before. In addition we provide the value associated with each iteration. Optionally, the msg argument can be specified, which is used exclusively by TriFusion to show a custom message.

At the end of a task its good practice to reset the progress indicators by using the _reset_pipes() method:

self._reset_pipes(ns)

Here, only the Namespace object is necessary, since the ProgressBar indicator is automatically reset on _set_pipes().

Incorporating a kill switch¶

All time consuming methods of AlignmentList accept a Namespace object when called from TriFusion, allowing communication between the main thread and the worker thread. Since python threads cannot be forcibly terminated like processes, most methods should listen to a kill switch flag that is actually an attribute of the Namespace object. This kill switch is already incorporated into the _set_pipes(), _update_pipes() and _reset_pipes() methods.

if ns:
    if ns.stop:
        raise KillByUser("")

What it does is to listen to a kill signal from TriFusion’s window (it can be clicking on the “Cancel” button, for example). When this kill signal is received in the main thread, it sets the Namespace.stop attribute to True in both main and worker threads. In the worker thread, when the stop attribute evaluates to True, a custom KillByUser exception is raised, immediately stopping the task. The exception is then handled in the trifusion.data.resources.background_tasks.process_execution() function and transmitted to the main thread.

Using SQLite¶

A great deal of the high performance and memory efficiency of the sequence module comes from the use of sqlite to store and manipulate alignment data on disk rather than RAM. This means that parsing, modifying and writing alignment data must be done with great care to ensure that only the essential data is loaded into memory, while minimizing the number of ( expensive) database queries. This has some implications for the methods of both Alignment and AlignmentList objects with respect to how parsing, alignment modification and output writing is performed.

Implications for parsing¶

When writing or modifying parsing methods it is important to take into account that alignment files can be very large (> 1Gb) and loading the entire data into memory should be avoided. Whenever possible, only the data of a single taxon should be kept in memory before inserting it into the database and then releasing that memory. For most formats, particularly leave formats, it’s fairly straightforward to do this. However, interleave formats can fragment the data from each taxon across the entire file. Since database insertions and updates are expensive, loading the data in each line can greatly decrease the performance in these formats. Therefore, it’s preferable to read the alignment file once per taxon, join the entire sequence of that taxon, and then insert it into the database. This ensures that only sequence data from one taxon is kept in memory at any given time and only a minimal number of database insertions are performed. It will also result in the same file being parsed N times, where N is the number of taxa. However, the decrease in speed is marginal, since most lines are actually skipped, whereas the efficiency increase is substantial.

Implications for fetching data¶

Retrieving data from an sqlite database is not as simple as accessing python native data structure. Therefore, a set of methods and generators have been defined in the Alignment object to facilitate the interface with the data in the database (See Data fetching). When some kind of data is required from the database, it is preferable to modify or create a dedicated method, instead of interacting with the database directly. This creates a layer of abstraction between the database and Alignment/ AlignmentList methods that greatly facilitates future updates.

Implications for alignment modification¶

When performing modification to the alignment data it is important to take into account that the original alignment data may need to be preserved for future operations. These methods must be defined and called using appropriate decorators and arguments that establish the name of the database table from where information will be retrieved, and the name of the table where the information will be written (See Alignment modifiers).

Implications for writing files¶

When writing alignment data into new output files, the same caution of the alignment parsing is advised. It’s quite easy to let the entire alignment be loaded into RAM, particularly when writing in interleave format.

class trifusion.process.sequence.Alignment(input_alignment, input_format=None, partitions=None, locus_length=None, sequence_code=None, taxa_list=None, taxa_idx=None, sql_cursor=None, sql_con=None)[source]¶

Bases: trifusion.process.base.Base

Main interface for single alignment files.

The Alignment class is the main interface for single alignment files, providing methods that parse, query, retrieve and modify alignment data. It is usually created within the context of AlignmentList, which handles the connection with the sqlite database. Nevertheless, Alignment instances can be created by providing the input_alignment argument, which can be one of the following:

path to alignment file. In this case, the file is parsed and information stored in a new sqlite table.
sqlite table name. In this case, the table name must be present in the database.

In either case, the sqlite Connection and Cursor objects should be provided.

When an Alignment object is instantiated, it first generates the table_name based on the input_alignment string, filtering all characters that are not alpha numeric. Then, it queries the database to check is a table already exists with that name. If yes, it is assumed that the alignment data is stored in the provided table name. If there is no table with that name, it is assumed that input_alignment is a path to the alignment file and the regular parsing ensues. An empty table is created, the sequence type, format and missing data symbol are automatically detected and the alignment is parsed according to the detected format.

Parameters:

input_alignment : str

Can be either the path to an alignment file or the name of a table in the sqlite database.

sql_cursor : sqlite3.Cursor

Cursor object of the sqlite database.

sql_con : sqlite3.Connection

Connection object of the sqlite database.

input_format : str, optional

File format of input_alignment. If input_alignment is a file path, the format will be automatically detect from the file. The value provided with this argument overrides the automatic detection’s result.

partitions : trifusion.process.data.Partitions, optional

If provided, it will set the partitions attribute. This should be used only when input_alignment is a database table name.

locus_length : int, optional

Sets the length of the current alignment, stored in the locus_length attribute. This option should only be used when input_alignment is a database table name. Otherwise, it is automatically set during alignment parsing.

sequence_code : tuple, optional

Sets the sequence_code attribute with the information on (<sequence_type>, <missing data symbol>). This option should only be

used when input_alignment is a database table name. Otherwise, it is automatically set during alignment parsing.

taxa_list : list, optional

Sets the list attribute taxa_list with the names of the taxa present in the alignment. This option should only be used when input_alignment is a database table name. Otherwise, it is automatically set during alignment parsing.

taxa_idx : dict, optional

Sets the dictionary attribute taxa_idx that maps the taxon names to their index in the sqlite database table. This option should only be used when input_alignment is a database table name. Otherwise, it is automatically set during alignment parsing.

Notes

The Alignment class is not meant to be used directly (although it is quite possible to do so if you handle the database connections before). Instead, use the AlignmentList class, even if there is only one alignment. All Alignment methods are available from the AlignmentList and it is also possible to retrieve specific Alignment objects.

The Alignment class was designed to be a lightweight, fast and powerful interface between alignment data and a set of manipulation and transformation methods. For performance and efficiency purposes, all alignment data is stored in a sqlite database that prevents the entire alignment from being loaded into memory. To facilitate the retrieval and iteration over the alignment data, several methods (iter_columns, ‘iter_sequences`, etc) are available to handle the interface with the database and retrieve only the necessary information.

Attributes

cur	(sqlite3.Cursor) Cursor object of the sqlite database.
con	(sqlite3.Connection) Connection object of the sqlite database.
table_name	(str) Name of the sqlite database’s table storing the sequence data.
tables	(list) Lists the sqlite database’s tables created for the Alignment instance. The “master” table is always present and represents the original alignment.
partitions	(trifusion.process.data.Partitions) Stores the partitions and substitution model definition for the alignment.
locus_length	(int) Length of the alignment in base pairs or residues.
restriction_range	(str) Only used when gaps are coded in a binary matrix. Stores a string with the range of the restriction-type data that will encode gaps and will only be used when nexus is in the output format.
e	(None or Exception) Stores any exceptions that occur during the parsing of the alignment file. It remains None unless something wrong happens.
taxa_list	(list) List with the active taxon names.
taxa_idx	(dict) Maps the taxon names to their corresponding index in the sqlite database. The index is not retrieved from the position of the taxon in taxa_list to prevent messing up when taxa are removed from the Alignment object.
shelved_taxa	(list) List of ignored taxon names.
path	(str) Full path to alignment file.
sname	(str) Basename of the alignment file without the extension.
name	(str) Basename of the alignment file with the extension.
sequence_code	(tuple) Contains information on (<sequence type>, <missing data symbol>), e.g. (“Protein”, “x”).
interleave_data	(bool) Attribute that is set to True when interleave data has been created for the alignment data.
input_format	(str) Format of the input alignment file.

Methods

`autofinder`(reference_file)	Autodetects format, missing data symbol and sequence type.
`change_taxon_name`(old_name, new_name)	Changes the name of a particular taxon.
`code_gaps`(args, *kwargs)	Code gap patterns as binary matrix at the end of alignment.
`collapse`(args, *kwargs)	Collapses equal sequences into unique haplotypes.
`consensus`(args, *kwargs)	Creates a consensus sequence from the alignment.
`duplicate_taxa`(taxa_list)	Identified duplicate items in a list.
`filter_codon_positions`(args, *kwargs)	Filter alignment according to codon positions.
`filter_informative_sites`(min_val, max_val[, ...])	Checks if the variable sites from alignment are within range.
`filter_missing_data`(args, *kwargs)	Filters alignment according to missing data.
`filter_segregating_sites`(min_val, max_val[, ...])	Checks if the segregating sites from alignment are within range.
`get_loci_taxa`(loci_file)	Get the list of taxa from a .loci file.
`get_sequence`(args, *kwargs)	Returns the sequence string for a given taxon.
`guess_code`(sequence)	Guess the sequence type, i.e.
`iter_alignment`(args, *kwargs)	Generator for (taxon, sequence) tuples.
`iter_columns`(args, *kwargs)	Generator over alignment columns.
`iter_columns_uniq`(args, *kwargs)	Generator over unique characters of an alignment column.
`iter_sequences`(args, *kwargs)	Generator over sequence strings in the alignment.
`read_alignment`([size_check])	Main alignment parser method.
`read_basic_csv`(file_handle)	Reads a basic CSV into a list.
`remove_alignment`()	Removes alignment data from the database.
`remove_taxa`(taxa_list_file[, mode])	Removes taxa from the Alignment object.
`reverse_concatenate`([table_in, db_con, ns, pbar])	Reverses the alignment according to the partitions attribute.
`rm_illegal`(taxon_string)	Removes illegal characters from taxon name.
`set_partitions`(partitions)	Updates the partition attribute.
`shelve_taxa`(lst)	Shelves taxa from Alignment methods.
`write_loci_correspondence`(hap_dict, output_file)	Writes the file mapping taxa to unique haplotypes for collapse.
`write_to_file`(output_format, output_file, ...)	Writes alignment data into an output file.

_check_partitions(partition_obj)[source]¶

Consistency check of the partition_obj for Alignment object.

Performs checks to ensure that a partition_obj is consistent with the alignment data of the Alignment object.

Parameters:

partition_obj : trifusion.process.data.Partitions

Partitions object.

See also

trifusion.process.data.Partitions

_create_table(table_name, cur=None)[source]¶

Creates a new table in the database.

Convenience method that creates a new table in the sqlite database. It accepts a custom Cursor object, which overrides the cur attribute.

Parameters:

table_name : str

Name of the table to be created.

cur : sqlite3.Cursor, optional

Custom Cursor object used to query the database.

_eval_missing_symbol(sequence)[source]¶

Evaluates missing data symbol from sequence string.

This method is performed when sequence data is being parsed from the alignment and only executes when the missing data symbol stored in Alignment.sequence_code is not defined. It attempts to count the regular characters used to denote missing data. First, tries to find ”?” characters, which are set as the symbol if they exist. Then, it finds “n” characters and if sequence type is set to DNA, this character is set as the symbol. Finally, it finds “x” characters and if sequence type is set to Protein, this character is set as the symbol.

Parameters:

sequence : str

Sequence string

_filter_columns(gap_threshold, missing_threshold, table_in, table_out, ns=None)[source]¶

Filters alignment columns based on missing data content.

Filters alignment columns based on the amount of missing data and gap content. If the calculated content is higher than a defined threshold, the column is removed from the final alignment. Setting gap_threshold and missing_threshold to 0 results in an alignment with no missing data, while setting them to 100 will not remove any column.

This method should not be called directly. Instead it is used by the filter_missing_data method.

Parameters:

gap_threshold : int

Integer between 0 and 100 defining the percentage above which a column with that gap percentage is removed.

missing_threshold : int

Integer between 0 and 100 defining the percentage above which a column with that gap+missing percentage is removed.

table_in : string

Name of database table containing the alignment data that is used for this operation.

table_out : string

Name of database table where the final alignment will be inserted (default is “collapsed”).

ns : multiprocessing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

See also

filter_missing_data

_filter_terminals(table_in, table_out, ns=None)[source]¶

Replace gaps at alignment’s extremities with missing data.

This method should not be called directly. Instead it is used by the filter_missing_data method.

Parameters:

table_in : string

Name of database table containing the alignment data that is used for this operation.

table_out : string

Name of database table where the final alignment will be inserted (default is “collapsed”).

ns : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

See also

filter_missing_data

_get_interleave_data(table_name=None, table_suffix=None, ns_pipe=None, pbar=None)[source]¶

Builds an interleave matrix on the database.

When the user requests an interleave output format, this method rearranges the data matrix (which is leave in the database) in interleave format in a new table. This should be performed only once, even when multiple formats are specified in interleave. The first one calls this class, and the subsequent ones use the this table.

Parameters:

ns_pipe : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

table_suffix : string, optional

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string, optional

Name of the table from where the sequence data is fetched.

_read_fasta()[source]¶

Alignment parser for fasta format.

Parses a fasta alignment file and stores taxa and sequence data in the database.

Returns:

size_list : list

List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in read_alignment.

See also

read_alignment

_read_interleave_nexus(ntaxa)[source]¶

Alignment parser for interleave nexus format.

Parses an interleave nexus alignment file and stores taxa and sequence data in the database.This method is only called from the _read_nexus method, when the regular nexus parser detects that the file is in interleave format.

Parameters:

ntaxa : int

Number of taxa contained in the alignment file.

Returns:

size_list : list

List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in read_alignment.

See also

_read_nexus

Notes

The nexus interleave format splits the alignment into blocks of a certain lenght (usually 90 characters) separated by blank lines. This means that to gather the complete sequence of a given taxon, the parser has to read the entire file. To prevent the entire alignment to be loaded into memory, we actually iterate over a range determined by the number of taxa in the alignment. In each iteration, we open a file handle and we retrieve only the sequence of a particular taxon. This ensures that only sequence data for a single taxon is store in memory at any given time. This also means that the alignment file has to be read N times, where N = number of taxa. However, since the vast majority of lines are actually skipped in each iteration, the decrease in speed is marginal, while the gains in memory efficient are much larger.

_read_interleave_phylip(ntaxa)[source]¶

Alignment parser for interleave phylip format.

Parses an interleave phylip alignment file and stores taxa and sequence data in the database. This method is only called from the _read_phylip method, when the regular phylip parser detects that the file is in interleave format.

Parameters:

ntaxa : int

Number of taxa contained in the alignment file.

Returns:

size_list : list

List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in read_alignment.

See also

_read_phylip

Notes

The phylip interleave format splits the alignment into blocks of a certain lenght (usually 90 characters) separated by blank lines. This means that to gather the complete sequence of a given taxon, the parser has to read the entire file. To prevent the entire alignment to be loaded into memory, we actually iterate over a range determined by the number of taxa in the alignment. In each iteration, we open a file handle and we retrieve only the sequence of a particular taxon. This ensures that only sequence data for a single taxon is store in memory at any given time. This also means that the alignment file has to be read N times, where N = number of taxa. However, since the vast majority of lines are actually skipped in each iteration, the decrease in speed is marginal, while the gains in memory efficient are much larger.

_read_loci()[source]¶

Alignment parser for pyRAD and ipyrad loci format.

Returns:

size_list : list

List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in read_alignment.

See also

read_alignment

_read_nexus()[source]¶

Alignment parser for nexus format.

Parses a nexus alignment file and stores taxa and sequence data in the database.

Returns:

size_list : list

List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in read_alignment.

See also

read_alignment

_read_phylip()[source]¶

Alignment parser for phylip format.

Parses a phylip alignment file and stored taxa and sequence data in the database.

Returns:

size_list : list

List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in read_alignment.

See also

read_alignment

_read_stockholm()[source]¶

Alignment parser for stockholm format.

Parses a stockholm alignment file and stores taxa and sequence data in the database.

Returns:

size_list : list

List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in read_alignment.

See also

read_alignment

static _reset_pipes(ns)[source]¶

Reset progress indicators for both GUI and CLI task executions.

This should be done at the end of any task that initialized the progress objects, but it only affects the Namespace object. It resets all Namespace attributes to None.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

See also

_set_pipes, _update_pipes

_set_format(input_format)[source]¶

Manually set the input format of the Alignment object.

Manually sets the input format associated with the Alignment object

Parameters:

input_format : str

The input format.

static _set_pipes(ns=None, pbar=None, total=None, msg=None)[source]¶

Setup of progress indicators for both GUI and CLI task executions.

This handles the setup of the objects responsible for updating the progress of task’s execution of both TriFusion (GUI) and TriSeq (CLI). At the beginning of any given task, these objects can be initialized by providing either the Namespace object (ns) in the case of TriFusion, or the ProgressBar object (pbar), in the case of TriSeq. Along with one of these objects, the expected total of the progress should also be provided. The ns and pbar objects are updated at each iteration of a given task, and the total is used to get a measure of the progress.

Optionally, a message can be also provided for the Namespace object that will be used by TriFusion.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

total : int

Expected total of the task’s progress.

msg : str, optional

A secondary message that appears in TriFusion’s progress dialogs.

See also

_reset_pipes, _update_pipes

Notes

The progress of any given task can be provided by either an Alignment or AlignmentList instance. Generally, the tasks follow the progress of the AlignmentList instance, unless that instance contains only one Alignment object. In that case, the progress information is piped from the Alignment instance. For that reason, and for the Namespace (ns) object only, this method first checks if the ns.sa attribute exists. If it does, it means that the parent AlignmentList instance contains only one Alignment object and, therefore, the progress information shoud come from here.

Examples

Start a progress counter for a task that will make 100 iterations:

self._set_pipes(ns=ns_obj, pbar=pbar_obj, total=100,
msg="Some message")

_table_exists(table_name, cur=None)[source]¶

Checks if a table exists in the database.

Convenience method that checks if a table exists in the database. It accepts a custom Cursor object, which overrides the cur attribute.

Parameters:

table_name : str

Name of the table.

cur: sqlite3.Cursor, optional

Custom Cursor object used to query the database.

Returns:

res : list

List with the results of a query for ‘table’ type with table_name name. Is empty when the table does not exist.

Notes

This returns a list that will contain one item if the table exists in the database. If it doesn’t exist, returns an empty list. Therefore, this can be used simply like:

if self._table_exists("my_table"):
     # Do stuff`

static _test_range(s, min_val, max_val)[source]¶

Test if a given s integer in inside a specified range.

Wrapper for the tests that determine whether a certain value (s) is within the range provided by min_val and max_val. Both of these arguments can be None, in which case there are not lower and/or upper boundaries, respectively.

Parameters:

s : int

Test value

min_val : int

Minimum range allowed for the test value.

max_val : int

Maximum range allowed for the test value.

Returns:

_ : bool

True if s within boundaries. False if not.

static _update_pipes(ns=None, pbar=None, value=None, msg=None)[source]¶

Update progress indicators for both GUI and CLI task executions.

This method provides a single interface for updating the progress objects ns or pbar, which should have been initialized at the beginning of the task with the _set_pipes method.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

value : int

Value of the current progress index

msg : str, optional

A secondary message that appears in TriFusion’s progress dialogs.

See also

_set_pipes, _reset_pipes

Notes

The progress of any given task can be provided by either an Alignment or AlignmentList instance. Generally, the tasks follow the progress of the AlignmentList instance, unless that instance contains only one Alignment object. In that case, the progress information is piped from the Alignment instance. For that reason, and for the Namespace (ns) object only, this method first checks if the ns.sa attribute exists. If it does, it means that the parent AlignmentList instance contains only one Alignment object and, therefore, the progress information shoud come from here.

Examples

Update the counter in an iteration of 100:

for i in range(100):
    self._update_pipes(ns=ns_obj, pbar=pbar_obj, value=i,
                       msg="Some string")

_write_fasta(output_file, **kwargs)[source]¶

Writes Alignment object into fasta output format.

Parameters:

output_file : str

Name of the output file. If using ns_pipe in TriFusion, it will prompt the user if the output file already exists.

interleave : bool, optional

Determines whether the output alignment will be in leave (False) or interleave (True) format. Not all output formats support this option.

ld_hat : bool, optional

If True, the fasta output format will include a first line compliant with the format of LD Hat and will truncate sequence names and sequence length per line accordingly.

ns_pipe : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

table_suffix : string, optional

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string, optional

Name of the table from where the sequence data is fetched.

_write_gphocs(output_file, **kwargs)[source]¶

Writes Alignment object into gphocs file format.

Parameters:

ns_pipe : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

table_suffix : string, optional

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string, optional

Name of the table from where the sequence data is fetched.

_write_ima2(output_file, **kwargs)[source]¶

Writes Alignment object into ima2 file format.

Parameters:

tx_space_ima2 : int, optional

Space (in characters) provided for the taxon name in ima2 format (default is 10).

cut_space_ima2 : int, optional

Set the maximum allowed character length for taxon names in ima2 format. Longer names are truncated (default is 8).

ima2_params : list

A list with the additional information required for the ima2 output format. The list should contains the following information:

(str) path to file containing the species and populations.

(str) Population tree in newick format, e.g. (0,1):2.

(str) Mutational model for all alignments.

(str) inheritance scalar.

ns_pipe : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

table_suffix : string, optional

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string, optional

Name of the table from where the sequence data is fetched.

_write_mcmctree(output_file, **kwargs)[source]¶

Writes Alignment object into mcmctree file format.

Parameters:

tx_space_phy : int, optional

Space (in characters) provided for the taxon name in phylip format (default is 40).

cut_space_phy : int, optional

Set the maximum allowed character length for taxon names in phylip format. Longer names are truncated (default is 39).

ns_pipe : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

table_suffix : string, optional

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string, optional

Name of the table from where the sequence data is fetched.

_write_nexus(output_file, **kwargs)[source]¶

Writes Alignment object into nexus file format.

Parameters:

output_format : list

List with the output formats to generate. Options are: {“fasta”, “phylip”, “nexus”, “stockholm”, “gphocs”, “ima2”}.

tx_space_nex : int, optional

Space (in characters) provided for the taxon name in nexus format (default is 40).

cut_space_nex : int, optional

Set the maximum allowed character length for taxon names in nexus format. Longer names are truncated (default is 39).

interleave : bool, optional

Determines whether the output alignment will be in leave (False) or interleave (True) format. Not all output formats support this option.

gap : str, optional

Symbol for alignment gaps (default is ‘-‘).

use_charset : bool, optional

If True, partitions from the Partitions object will be written in the nexus output format (default is True).

use_nexus_models : bool, optional

If True, writes the partitions charset block in nexus format.

outgroup_list : list, optional

Specify list of taxon names that will be defined as the outgroup in the nexus output format. This may be useful for analyses with MrBayes or other software that may require outgroups.

ns_pipe : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

table_suffix : string, optional

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string, optional

Name of the table from where the sequence data is fetched.

_write_phylip(output_file, **kwargs)[source]¶

Writes Alignment object into phylip format.

Parameters:

output_file : str

Name of the output file. If using ns_pipe in TriFusion, it will prompt the user if the output file already exists.

tx_space_phy : int, optional

Space (in characters) provided for the taxon name in phylip format (default is 40).

cut_space_phy : int, optional

Set the maximum allowed character length for taxon names in phylip format. Longer names are truncated (default is 39).

interleave : bool, optional

Determines whether the output alignment will be in leave (False) or interleave (True) format. Not all output formats support this option.

phy_truncate_names : bool, optional

If True, taxon names in phylip format are truncated to a maximum of 10 characters (default is False).

partition_file : bool, optional

If True, the auxiliary partitions file will be written (default is True).

model_phylip : str, optional

Substitution model for the auxiliary partition file of phylip format, compliant with RAxML.

ns_pipe : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

table_suffix : string, optional

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string, optional

Name of the table from where the sequence data is fetched.

_write_stockholm(output_file, **kwargs)[source]¶

Writes Alignment object into stockholm file format.

Parameters:

ns_pipe : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

table_suffix : string, optional

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string, optional

Name of the table from where the sequence data is fetched.

change_taxon_name(old_name, new_name)[source]¶

Changes the name of a particular taxon.

Parameters:

old_name : str

Original taxon name.

new_name : str

New taxon name.

code_gaps(*args, **kwargs)[source]¶

Code gap patterns as binary matrix at the end of alignment.

This method codes gaps present in the alignment in binary format, according to the method of Simmons and Ochoterena (2000), to be read by phylogenetic programs such as MrBayes. The resultant alignment, however, can only be output in the Nexus format.

Parameters:

table_in : string, optional

Name of database table containing the alignment data that is used for this operation.

table_out : string, optional

Name of database table where the final alignment will be inserted (default is “gaps”).

use_main_table : bool, optional

If True, both table_in and table_out are ignore and the main table Alignment.table_name is used as the input and output table (default is False).

ns : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

See also

SetupDatabase

collapse(*args, **kwargs)[source]¶

Collapses equal sequences into unique haplotypes.

Collapses equal sequences into unique haplotypes. This method fetches the sequences for the current alignment and creates a new database table with the collapsed haplotypes.

Parameters:

write_haplotypes : bool, optional

If True, a file mapping the taxa names name to their respective haplotype name will be generated (default is True).

haplotypes_file : string, optional

Name of the file mapping taxa names to their respective haplotype. Only used when write_haplotypes is True. If it not provided and write_haplotypes is True, the file name will be determined from Alignment.name.

dest : string, optional

Path of directory where haplotypes_file will be generated (default is ”.”).

conversion_suffix : string, optional

Suffix appended to the haplotypes file. Only used when write_haplotypes is True and haplotypes_file is None.

haplotype_name : string, optional

Prefix of the haplotype string. The final haplotype name will be haplotype_name + <int> (default is “Hap”).

table_in : string, optional

Name of database table containing the alignment data that is used for this operation.

table_out : string, optional

Name of database table where the final alignment will be inserted (default is “collapsed”).

use_main_table : bool, optional

If True, both table_in and table_out are ignore and the main table Alignment.table_name is used as the input and output table (default is False).

ns : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

See also

SetupDatabase

Notes

In order to collapse an alignment, all sequence would need to be compared and somehow stored until the last sequence is processed. This creates the immediate issue that, if all sequences are different, the entire alignment could have to be stored in memory. To increase memory efficiency and increase performance, sequences are converted into hashes, and these hashes are compared among each other instead of the actual sequences. This means that instead of comparing potentially very long string and maintaining them in memory, only a small hash string is maintained and compared. In this way, only a fraction of the information is stored in memory and the comparisons are much faster.

consensus(*args, **kwargs)[source]¶

Creates a consensus sequence from the alignment.

Converts the current Alignment alignment data into a single consensus sequence. The consensus_type argument determines how variation in the original alignment is handled for the generation of the consensus sequence.

The options for handling sequence variation are:

IUPAC: Converts variable sites according to the corresponding IUPAC symbols (DNA sequence type only)

Soft mask: Converts variable sites into missing data

Remove: Removes variable sites

First sequence: Uses the first sequence in the dictionary

Parameters:

consensus_type : {“IUPAC”, “Soft mask”, “Remove”, “First sequence”}

Type of variation handling. See summary above.

table_in : string, optional

Name of database table containing the alignment data that is used for this operation.

table_out : string, optional

Name of database table where the final alignment will be inserted (default is “consensus”).

use_main_table : bool, optional

If True, both table_in and table_out are ignore and the main table Alignment.table_name is used as the input and output table (default is False).

ns : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

See also

SetupDatabase

e = None¶: The e attribute will store any exceptions that occur during the parsing of the alignment object. It remains None unless something wrong happens.

filter_codon_positions(*args, **kwargs)[source]¶

Filter alignment according to codon positions.

Filters codon positions from Alignment object with DNA sequence type. The position_list argument determines which codon positions will be stored. It should be a three element list, corresponding to the three codon position, and the positions that are saved are the ones with a True value.

Parameters:

position_list : list

List of three bool elements that correspond to each codon position. Ex. [True, True, True] will save all positions while [True, True, False] will exclude the third codon position

table_in : string, optional

Name of database table containing the alignment data that is used for this operation.

table_out : string, optional

Name of database table where the final alignment will be inserted (default is “filter”).

use_main_table : bool, optional

If True, both table_in and table_out are ignore and the main table Alignment.table_name is used as the input and output table (default is False).

ns : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

See also

SetupDatabase

filter_informative_sites(min_val, max_val, table_in=None, ns=None, pbar=None)[source]¶

Checks if the variable sites from alignment are within range.

Similar to filter_segregating_sites method, but only considers informative sites (variable sites present in more than 2 taxa).

Parameters:

min_val : int

Minimum number of informative sites for the alignment to pass. Can be None, in which case there is no lower bound.

max_val : int

Maximum number of informative sites for the alignment to pass. Can be None, in which case there is no upper bound.

table_in : string, optional

Name of database table containing the alignment data that is used for this operation.

ns : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

Returns:

_ : bool

True if the number of informative sites are within the specified range. False if not.

See also

filter_segregating_sites

filter_missing_data(*args, **kwargs)[source]¶

Filters alignment according to missing data.

Filters gaps and true missing data from the alignment using tolerance thresholds for each type of missing data. Both thresholds are maximum percentages of sites in an alignment column containing the type of missing data. If gap_threshold=50, for example, alignment columns with more than 50% of sites with gaps are removed.

Parameters:

gap_threshold : int

Integer between 0 and 100 defining the percentage above which a column with that gap percentage is removed.

missing_threshold : int

Integer between 0 and 100 defining the percentage above which a column with that gap+missing percentage is removed.

table_in : string, optional

Name of database table containing the alignment data that is used for this operation.

table_out : string, optional

Name of database table where the final alignment will be inserted (default is “filter”).

use_main_table : bool, optional

If True, both table_in and table_out are ignore and the main table Alignment.table_name is used as the input and output table (default is False).

ns : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

filter_segregating_sites(min_val, max_val, table_in=None, ns=None, pbar=None)[source]¶

Checks if the segregating sites from alignment are within range.

Evaluates the number of segregating sites of the current alignment and returns True if it falls between the range specified by the min_val and max_val arguments.

Parameters:

min_val : int

Minimum number of segregating sites for the alignment to pass. Can be None, in which case there is no lower bound.

max_val : int

Maximum number of segregating sites for the alignment to pass. Can be None, in which case there is no upper bound.

table_in : string, optional

Name of database table containing the alignment data that is used for this operation.

ns : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

Returns:

_ : bool

True if the number of segregating sites are within the specified range. False if not.

See also

filter_informative_sites

get_sequence(*args, **kwargs)[source]¶

Returns the sequence string for a given taxon.

Returns the sequence string of the corresponding taxon. If the taxon does not exist in the table, raises a KeyError. The sequence is retrieved from a database table specified either by the table_name or table_suffix arguments. table_name will always take precedence over table_suffix if both are provided. If none are provided, the default Alignment.table_name is used. If the table name provided by either table_name or table_suffix is invalid, the default Alignment.table_name is also used.

Parameters:

taxon : str

Name of the taxon.

table_suffix : string

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string

Name of the table from where the sequence data is fetched.

table : string

This argument will be automatically setup by the SetupInTable decorator. Do not use directly.

Returns:

seq : str

Sequence string.

Raises:

KeyError

If the taxon does not exist in the specified table.

Notes

Ignores data associated with taxa present in the shelved_taxa attribute.

input_format = None¶: Format of the alignment file.

interleave_data = None¶: Attribute that is set to True when interleave data has been created for the alignment data. Buiding the interleave matrix is a bit costly, so when it is requested by the user it is built once and stored in the database, and then further usages will use that table.

iter_alignment(*args, **kwargs)[source]¶

Generator for (taxon, sequence) tuples.

Generator that yields (taxon, sequence) tuples from the database. Sequence data is retrieved from a database table specified either by the table_name or table_suffix arguments. table_name will always take precedence over table_suffix if both are provided. If none are provided, the default Alignment.table_name is used. If the table name provided by either table_name or table_suffix is invalid, the default Alignment.table_name is also used.

Parameters:

table_suffix : string

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string

Name of the table from where the sequence data is fetched.

table : string

This argument will be automatically setup by the SetupInTable decorator. Do not use directly.

Yields:

tx : string

Taxon name.

seq : string

Sequence string.

See also

SetupInTable

Notes

All generators ignore data associated with taxa present in the shelved_taxa attribute.

iter_columns(*args, **kwargs)[source]¶

Generator over alignment columns.

Generator that yields the alignment columns in a list of characters. Sequence data is retrieved from a database table specified either by the table_name or table_suffix arguments. table_name will always take precedence over table_suffix if both are provided. If none are provided, the default Alignment.table_name is used. If the table name provided by either table_name or table_suffix is invalid, the default Alignment.table_name is also used.

The table variable will be automatically setup in the SetupInTable decorator according to table_suffix and table_name, and should not be used directly (values provided when calling it are actually ignored). The decorator will set this variable

Parameters:

table_suffix : string

Suffix of the table from where the sequence data is fetched. The suffix is appended to Alignment.table_name.

table_name : string

Name of the table from where the sequence data is fetched.

table : string

This argument will be automatically setup by the SetupInTable decorator. Do not use directly.

Yields:

i : list

List with alignment column characters as elements ([“A”, “A”, “C”, “A”, “A”]).

See also

SetupInTable

Notes

To prevent the query from loading the entire alignment data into memory, two techniques are employed. First, the iteration over the query results is lazy, using itertools.izip. Second, a range iteration is performed to ensure that only blocks of 100k alignment sites are fetch at any given time. Several tests were performed to ensure that this block size provided a good trade-off between RAM usage and speed.

All generators ignore data associated with taxa present in the shelved_taxa attribute.

iter_columns_uniq(*args, **kwargs)[source]¶

Generator over unique characters of an alignment column.

Generator that yields unique elements of alignment columns in a list. Sequence data is retrieved from a database table specified either by the table_name or table_suffix arguments. table_name will always take precedence over table_suffix if both are provided. If none are provided, the default Alignment.table_name is used. If the table name provided by either table_name or table_suffix is invalid, the default Alignment.table_name is also used.

The table variable will be automatically setup in the SetupInTable decorator according to table_suffix and table_name, and should not be used directly (values provided when calling it are actually ignored). The decorator will set this variable

Parameters:

table_suffix : string

Suffix of the table from where the sequence data is fetched. The suffix is appended to Alignment.table_name.

table_name : string

Name of the table from where the sequence data is fetched.

table : string

This argument will be automatically setup by the SetupInTable decorator. Do not use directly.

Yields:

i : list

List with alignment column characters as elements ([“A”, “C”]).

See also

SetupInTable

Notes

To prevent the query from loading the entire alignment data into memory, two techniques are employed. First, the iteration over the query results is lazy, using itertools.izip. Second, a range iteration is performed to ensure that only blocks of 100k alignment sites are fetch at any given time. Several tests were performed to ensure that this block size provided a good trade-off between RAM usage and speed.

All generators ignore data associated with taxa present in the shelved_taxa attribute.

iter_sequences(*args, **kwargs)[source]¶

Generator over sequence strings in the alignment.

Generator for sequence data of the Alignment object. Sequence data is retrieved from a database table specified either by the table_name or table_suffix arguments. table_name will always take precedence over table_suffix if both are provided. If none are provided, the default Alignment.table_name is used. If the table name provided by either table_name or table_suffix is invalid, the default Alignment.table_name is also used.

Parameters:

table_suffix : string

Suffix of the table from where the sequence data is fetched. The suffix is appended to Alignment.table_name.

table_name : string

Name of the table from where the sequence data is fetched.

table : string

This argument will be automatically setup by the SetupInTable decorator. Do not use directly.

Yields:

seq : string

Sequence string for a given taxon

See also

SetupInTable

Notes

All generators ignore data associated with taxa present in the shelved_taxa attribute.

locus_length = None¶: The length of the alignment object. Even if the current alignment object is partitioned, this will return the length of the entire alignment

name = None¶: Attribute with basename of alignment file

partitions = None¶: Initializing a Partitions instance for the current alignment. By default, the Partitions instance will assume that the whole Alignment object is one single partition. However, if the current alignment is the result of a concatenation, or if a partitions file is provided, the partitions argument can be used. Partitions can be later changed using the set_partitions method. Substitution models objects are associated with each partition and they default to None

path = None¶: Attribute with full path to alignment file

read_alignment(size_check=True)[source]¶

Main alignment parser method.

This is the main alignment parsing method that is called when the Alignment object is instantiated with a file path as the argument. Given the alignment format automatically detected in __init__, it calls the specific method that parses that alignment format. After the execution of this method, all attributes of the class will be set and the full range of methods can be applied.

Parameters:

size_check : bool, optional

If True, perform a check for sequence size consistency across taxa (default is True).

remove_alignment()[source]¶: Removes alignment data from the database.

remove_taxa(taxa_list_file, mode='remove')[source]¶

Removes taxa from the Alignment object.

Removes taxa based on the taxa_list_file argument from the Alignment object.

This method supports a list or path to CSV file as the taxa_list_file argument. In case of CSV path, it parses the CSV file into a list. The CSV file should be a simple text file containing taxon names in each line.

This method also supports two removal modes.

remove: removes the specified taxa
inverse: removes all but the specified taxa

Parameters:

taxa_list_file : list or string

List containing taxa names or string with path to a CSV file containig the taxa names.

mode : {“remove”, “inverse”}, optional

The removal mode (default is remove).

restriction_range = None¶: This option is only relevant when gaps are coded. This will store a string with the range of the restriction-type data that will encode gaps and will only be used when nexus is in the output format

reverse_concatenate(table_in='', db_con=None, ns=None, pbar=None)[source]¶

Reverses the alignment according to the partitions attribute.

This method splits the current alignment according to the partitions set in the partitions attribute. Returns an AlignmentList object where each partition is an Alignment object.

Parameters:

table_in : string

Name of database table containing the alignment data that is used for this operation.

db_con : sqlite3.Connection

Database connection object that will be provided to each new Alignment object.

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

set_partitions(partitions)[source]¶

Updates the partition attribute.

Parameters:

partitions : trifusion.process.data.Partitions

Partitions object.

shelve_taxa(lst)[source]¶

Shelves taxa from Alignment methods.

The taxa provided in the lst argument will be ignored in subsequent operations. In practice, taxa from the taxa_list attribute that are present in lst will be moved to the shelved_taxa attribute and ignored in all methods of the class. To revert all taxa to the taxa_list attribute, simply call this method with an empty list.

Parameters:

lst : list

List with taxa names that should be ignored.

shelved_taxa = None¶: Attribute that will store shelved taxa. When retrieving taxa from the database, this list will be checked and present taxa will be ignored

sname = None¶: Attribute with basename of alignment file without extension

table_name = None¶: Creates a database table for the current alignment. This will be the link attribute between the sqlite database and the remaining methods that require information on the sequence data. This also removes any non alpha numeric characters the table name might have and ensures that it starts with a aphabetic character to avoid a database error

tables = None¶: Lists the currently active tables for the Alignment object. The ‘master’ table is always present and represents the original alignment. Additional tables may be added as needed, and then dropped when no longer necessary. When all tables, except the master, need to be removed, this attribute can be used to quickly drop all derived tables

taxa_idx = None¶: Attribute that will store the index of the taxa in the sql database. This index is stored in a dictionary instead of being retrieved from the index position of the taxa_list list because taxa may be removed (which would mess up the link between the two indexes)

taxa_list = None¶: Attribute that will store the taxa names.

static write_loci_correspondence(hap_dict, output_file, dest='./')[source]¶

Writes the file mapping taxa to unique haplotypes for collapse.

Parameters:

hap_dict : dict

Dictionary mapping the haplotype (key) to a list of taxa (value) sharing the same haplotype.

output_file : str

Name of the haplotype correspondence file

dest : str, optional

Path to directory where the output_file will be written.

write_to_file(output_format, output_file, **kwargs)[source]¶

Writes alignment data into an output file.

Writes the alignment object into a specified output file, automatically adding the extension, according to the specified output formats.

This function supports the writing of both converted (no partitions) and concatenated (partitioned files). The choice between these modes is determined by the Partitions object associated with the Alignment object. If it contains multiple partitions, it will produce a concatenated alignment and the auxiliary partition files where necessary. Otherwise it will treat the alignment as a single partition.

Parameters:

output_format : list

List with the output formats to generate. Options are: {“fasta”, “phylip”, “nexus”, “stockholm”, “gphocs”, “ima2”}.

output_file : str

Name of the output file. If using ns_pipe in TriFusion, it will prompt the user if the output file already exists.

tx_space_nex : int, optional

Space (in characters) provided for the taxon name in nexus format (default is 40).

tx_space_phy : int, optional

Space (in characters) provided for the taxon name in phylip format (default is 40).

tx_space_ima2 : int, optional

Space (in characters) provided for the taxon name in ima2 format (default is 10).

cut_space_nex : int, optional

Set the maximum allowed character length for taxon names in nexus format. Longer names are truncated (default is 39).

cut_space_phy : int, optional

Set the maximum allowed character length for taxon names in phylip format. Longer names are truncated (default is 39).

cut_space_ima2 : int, optional

Set the maximum allowed character length for taxon names in ima2 format. Longer names are truncated (default is 8).

interleave : bool, optional

Determines whether the output alignment will be in leave (False) or interleave (True) format. Not all output formats support this option.

gap : str, optional

Symbol for alignment gaps (default is ‘-‘).

model_phylip : str, optional

Substitution model for the auxiliary partition file of phylip format, compliant with RAxML.

outgroup_list : list, optional

Specify list of taxon names that will be defined as the outgroup in the nexus output format. This may be useful for analyses with MrBayes or other software that may require outgroups.

ima2_params : list, optional

A list with the additional information required for the ima2 output format. The list should contains the following information:

(str) path to file containing the species and populations.

(str) Population tree in newick format, e.g. (0,1):2.

(str) Mutational model for all alignments.

(str) inheritance scalar.

use_charset : bool, optional

If True, partitions from the Partitions object will be written in the nexus output format (default is True).

partition_file : bool, optional

If True, the auxiliary partitions file will be written (default is True).

output_dir : str, optional

If provided, the output file will be written on the specified directory.

phy_truncate_names : bool, optional

If True, taxon names in phylip format are truncated to a maximum of 10 characters (default is False).

ld_hat : bool, optional

If True, the fasta output format will include a first line compliant with the format of LD Hat and will truncate sequence names and sequence length per line accordingly.

use_nexus_models : bool, optional

If True, writes the partitions charset block in nexus format.

ns_pipe : multiprocesssing.Manager.Namespace, optional

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

table_suffix : string, optional

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string, optional

Name of the table from where the sequence data is fetched.

exception trifusion.process.sequence.AlignmentException[source]¶

Bases: exceptions.Exception

Generic Alignment object exception.

class trifusion.process.sequence.AlignmentList(alignment_list, sql_db=None, db_cur=None, db_con=None, pbar=None)[source]¶

Bases: trifusion.process.base.Base

Main interface for groups of Alignment objects.

The AlignmentList class can be seen as a group of Alignment objects that is treated as a single cohesive data set. It is the main class used to perform alignment operations on TriFusion and TriSeq programs, even when there is only a single alignment.

An instance can be created by simply providing a list of paths to alignment files, which can be empty (and alignments added later). It is recommended that the path to the sqlite database be specified via the sql_db argument, but this is optional.

Individual alignment files can be provided when instantiating the class or added later, and each is processed and stored as an Alignment object. The AlignmentList object provides attributes that are relative to the global data set. For instance, each Alignment object may have their individual and unique list of taxa, but the corresponding attribute in the AlignmentList object contains the list of all taxa that are found in the Alignment object list.

Several methods of the Alignment class are also present in this class with the same usage (e.g. collapse). These are basically wrappers that apply the method to all Alignment objects and may have some modifications adapted to multiple alignments.

Other methods are exclusive and desgined to deal with multiple alignments, such as concatenation (concatenate), filter by taxa (filter_by_taxa), etc.

Plotting methods are exclusive to AlignmentList, even the ones that are focused on single alignments. In this case, the Alignment object can be specified and processed individually directly from this class.

Parameters:

alignment_list : list

List with paths to alignment files. Can be empty.

sql_db : str, optional

Path where the sqlite3 database will be created.

db_cur : sqlite3.Cursor, optional

Provide a database Cursor object in conjunction with the Connection object (db_con) to connect to an existing database.

db_con : sqlite3.Connection, optional

Provide a database Connection object in conjunction with the Cursor object (db_cur) to connect to an existing database.

pbar : ProgressBar, optional

A ProgressBar object used to log the progress of TriSeq execution.

Attributes

con	(sqlite3.Connection) Connection object of the sqlite database.
cur	(sqlite3.Cursor) Cursor object of the sqlite database.
sql_path	(str) Path to sqlite3 database file.
alignments	(collections.OrderedDict) Stores the ‘active’ Alignment objects as values, with the corresponding key being Alignment.path.
shelve_alignments	(collections.OrderedDict) Stores the ‘inactive’ or ‘shelved’ Alignment objects. AlignmentList methods will operate only on the alignments attribute, unless explicitly stated otherwise. The key:value is the same as in the alignments attribute.
bad_alignments	(list) Stores the Alignment.name attribute of badly formatted alignments.
duplicate_alignments	(list) Stores the Alignment.name attribute of duplicated alignment paths.
non_alignments	(list) Stores the Alignment.name attribute of sequence sets of unequal length.
taxa_names	(list) Stores the name of all taxa across the Alignment object list.
shelved_taxa	(list) Stores the ‘inactive’ or ‘shelved’ taxa. AlignmentList methods will operate only on the taxa_names attribute, unless explicitly stated otherwise.
path_list	(list) List of Alignment.paths for all Alignment objects.
filtered_alignments	(collections.OrderedDict) Ordered dictionary of four key:values used to count the number of Alignment objects that were filtered by four operations. The keys are {“By minimum taxa”, “By taxa”, “By variable sites”, “By informative sites”}.
sequence_code	(tuple) Contains information on (<sequence type>, <missing data symbol>), e.g. (“Protein”, “x”).
gap_symbol	(str) Symbol used to denote gaps in the alignment (default is ‘-‘).
summary_stats	(dict) Dictionary that stores several summary statistics that are calculated once the get_summary_stats method is executed.
summary_gene_table	(pandas.DataFrame) DataFrame containing the summary statistics for each Alignment object. Also populated when the get_summary_stats method is executed.
active_tables	(list) List of active database tables associated with the AlignmentList instance.
partitions	(trifusion.process.data.Partitions) Partitions object that refers to the total AlignmentList.

Methods

`add_alignment_files`(file_name_list[, pbar, ns])	Adds a list of alignment files to the current AlignmentList.
`add_alignments`(alignment_obj_list[, ...])	Add a list of Alignment objects to the current AlignmentList.
`allele_frequency_spectrum`(args, *kwargs)	Creates data for the allele frequency spectrum.
`allele_frequency_spectrum_gene`(args, *kwargs)	Creates data for the allele frequency spectrum of an Alignment.
`aln_names`()	Returns list of basenames of alignments file paths.
`autofinder`(reference_file)	Autodetects format, missing data symbol and sequence type.
`average_seqsize`(args, *kwargs)	Creates data for the average sequence size for the entire data set.
`average_seqsize_per_species`(args, *kwargs)	Creates data for average sequence size per taxon.
`change_taxon_name`(old_name, new_name)	Changes the name of a taxon.
`characters_proportion`(args, *kwargs)	Creates data for the proportion of nt/aa for the data set.
`characters_proportion_per_species`(*args, ...)	Creates data for the proportion of nt/aa per species.
`clear_alignments`()	Clears all attributes and data from the AlignmentList object.
`code_gaps`(args, *kwargs)	Code gaps in each Alignment object.
`collapse`(args, *kwargs)	Collapses equal sequences for each Alignment object.
`concatenate`([table_in, ns, pbar])	Concatenates alignments into a single Alignment object.
`consensus`(args, *kwargs)	Creates consensus sequences for each alignment.
`cumulative_missing_genes`(args, *kwargs)	Creates data for a distribution of the maximum number of genes available for consecutive thresholds of missing data.
`duplicate_taxa`(taxa_list)	Identified duplicate items in a list.
`filter_by_taxa`(taxa_list, filter_mode[, ns, ...])	Filters alignments if they contain or exclude certain taxa.
`filter_codon_positions`(args, *kwargs)	Filters codon positions in each Alignment object.
`filter_informative_sites`(args, *kwargs)	Filters Alignment objects according to informative sites number.
`filter_min_taxa`(min_taxa[, ns, pbar])	Filters alignments by minimum taxa proportion.
`filter_missing_data`(args, *kwargs)	Filters missing data in each Alignment object.
`filter_segregating_sites`(args, *kwargs)	Filters Alignment objects according to segregating sites number.
`format_list`()	Returns list of unique sequence types from Alignment objects.
`gene_occupancy`(args, *kwargs)	Create data for gene occupancy plot.
`get_gene_table_stats`([active_alignments, ...])	Returns summary statistics for each Alignment.
`get_loci_taxa`(loci_file)	Get the list of taxa from a .loci file.
`get_summary_stats`([active_alignments, ns])	Calculates summary statistics for the ‘active’ alignments.
`get_tables`()	Return list with table_name of all Alignment objects.
`guess_code`(sequence)	Guess the sequence type, i.e.
`iter_alignment_files`()	Returns an iterable on Alignment.path of each alignment.
`length_polymorphism_correlation`(args, *kwargs)	Creates data for correlation between alignment length and informative sites.
`missing_data_distribution`(args, *kwargs)	Create data for overall distribution of missing data plot.
`missing_data_per_species`(args, *kwargs)	Creates data for missing data proportion per species plot.
`missing_genes_average`(args, *kwargs)	Creates histogram data with average missing taxa.
`missing_genes_per_species`(args, *kwargs)	Creates data with distribution of missing genes per species plot.
`outlier_missing_data`(args, *kwargs)	Create data for missing data alignment outliers.
`outlier_missing_data_sp`(args, *kwargs)	Create data for missing data taxa outliers.
`outlier_segregating`(args, *kwargs)	Creates data for segregating site alignment outliers.
`outlier_segregating_sp`(args, *kwargs)	Create data for segregating sites taxa outliers.
`outlier_sequence_size`(args, *kwargs)	Creates data for sequence size alignment outliers.
`outlier_sequence_size_sp`(args, *kwargs)	Create data for sequence size taxa outliers.
`read_basic_csv`(file_handle)	Reads a basic CSV into a list.
`remove_file`(filename_list)	Removes alignments.
`remove_tables`([preserve_tables, trash_tables])	Drops tables from the database.
`remove_taxa`(taxa_list[, mode])	Removes the specified taxa.
`resume_database`()	Reconnects to the sqlite database.
`retrieve_alignment`(name)	Return Alignment object with a given name.
`reverse_concatenate`([ns])	Reverse a concatenated file according to the partitions.
`rm_illegal`(taxon_string)	Removes illegal characters from taxon name.
`select_by_taxa`(taxa_list[, mode])	Selects a list of Alignment objects by taxa.
`sequence_segregation`(args, *kwargs)	Creates data for overall distribution of segregating sites.
`sequence_segregation_gene`(args, *kwargs)	Create data for a sliding window analysis of segregating sites.
`sequence_segregation_per_species`(args, *kwargs)	Creates data for a triangular matrix of sequence segregation for pairs of taxa.
`sequence_similarity`(args, *kwargs)	Creates data for average sequence similarity plot.
`sequence_similarity_gene`(args, *kwargs)	Creates data for sliding window sequence similarity for alignment.
`sequence_similarity_per_species`(args, *kwargs)	Creates data for sequence similarity per species pair plot.
`set_database_connections`(cur, con)	Provides Connection and Cursor to Alignment objects.
`set_partition_from_alignment`(alignment_obj)	Updates partitions object with the provided Alignment object.
`taxa_distribution`(args, *kwargs)	Creates data for a distribution of taxa frequency across alignments.
`update_active_alignment`(aln_name, direction)	Updates the ‘active’ status of a single alignment.
`update_active_alignments`([aln_list, all_files])	Sets the active alignments.
`update_taxa_names`([taxa_list, all_taxa])	Sets the active taxa.
`write_taxa_to_file`([file_name])	Writes the taxa in taxa_list to a file.
`write_to_file`(output_format[, ...])	Writes Alignment objects into files.

_create_table(table_name, index=None, cur=None)[source]¶

Creates a new table in the database.

Convenience method that creates a new table in the sqlite database. It accepts a custom Cursor object, which overrides the cur attribute. Optionally, an index can also be specified.

Parameters:

table_name : str

Name of the table to be created.

index : list, optional

Provide a list with [<name of index>, <columns to be indexed>]. (e.g., [“concindex”, “txId”]).

cur : sqlite3.Cursor, optional

Custom Cursor object used to query the database.

_get_filename_list()[source]¶

Returns list with the Alignment.name of alignments.

Returns:

_ : list

List of Alignment.name of alignments.

_get_informative_sites(aln)[source]¶

Calculates number of informative sites in an Alignment.

Parameters:

aln : trifusion.process.sequence.Alignment

Alignment object.

Returns:

informative_sites : int

Number of informative sites for Alignment object.

_get_similarity = <functools.partial object>¶

_get_taxa_list(only_active=False)[source]¶

Gets the global taxa list from all alignments.

If called with no arguments, it returns the taxa list from all alignments, including the ‘inactive’ ones. The only_active argument can be set to True to return only the taxa list from the ‘active’ files.

Parameters:

only_active : bool

If True, only returns the taxa list from the ‘active’ alignments.

Returns:

full_taxa : list

List with taxon names as strings.

static _mad_based_outlier(p, threshold=3.5)[source]¶

Calculate MAD based outliers.

An outlier detection method based on median absolute deviation. This code was adapted from http://stackoverflow.com/a/22357811/1990165. The usage of the median is much less biased that the mean and is robust to smaller data sets.

Parameters:

p : numpy.array

Array with data observations.

threshold : float

Modified Z-score to use as a threshold.

static _reset_pipes(ns)[source]¶

Reset progress indicators for both GUI and CLI task executions.

This should be done at the end of any task that initialized the progress objects, but it only affects the Namespace object. It resets all Namespace attributes to None.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

_reset_summary_stats()[source]¶: Resets the summary_stats attribute.

_set_pipes(ns=None, pbar=None, total=None, msg=None)[source]¶

Setup of progress indicators for both GUI and CLI task executions.

This handles the setup of the objects responsible for updating the progress of task’s execution of both TriFusion (GUI) and TriSeq (CLI). At the beginning of any given task, these objects can be initialized by providing either the Namespace object (ns) in the case of TriFusion, or the ProgressBar object (pbar), in the case of TriSeq. Along with one of these objects, the expected total of the progress should also be provided. The ns and pbar objects are updated at each iteration of a given task, and the total is used to get a measure of the progress.

Optionally, a message can be also provided for the Namespace object that will be used by TriFusion.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

total : int

Expected total of the task’s progress.

msg : str, optional

A secondary message that appears in TriFusion’s progress dialogs.

Notes

The progress of any given task can be provided by either an Alignment or AlignmentList instance. Generally, the tasks follow the progress of the AlignmentList instance, unless that instance contains only one Alignment object. In that case, the progress information is piped from the Alignment instance. For that reason, and for the Namespace (ns) object only, this method checks if there is only one ‘active’ Alignment object. If yes, set the Namepsace.sa flag to True, so that the progress indicators are piped from the Alignment object instead.

Examples

Start a progress counter for a task that will make 100 iterations:

self._set_pipes(ns=ns_obj, pbar=pbar_obj, total=100,
                msg="Some message")

_table_exists(table_name, cur=None)[source]¶

Checks if a table exists in the database.

Convenience method that checks if a table exists in the database. It accepts a custom Cursor object, which overrides the cur attribute.

Parameters:

table_name : str

Name of the table.

cur: sqlite3.Cursor, optional

Custom Cursor object used to query the database.

Returns:

res : list

List with the results of a query for ‘table’ type with table_name name. Is empty when the table does not exist.

Notes

This returns a list that will contain one item if the table exists in the database. If it doesn’t exist, returns an empty list. Therefore, this can be used simply like:

if self._table_exists("my_table"):
     # Do stuff

static _update_pipes(ns=None, pbar=None, value=None, msg=None)[source]¶

Update progress indicators for both GUI and CLI task executions.

This method provides a single interface for updating the progress objects ns or pbar, which should have been initialized at the beginning of the task with the _set_pipes method.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

value : int

Value of the current progress index

msg : str, optional

A secondary message that appears in TriFusion’s progress dialogs.

Notes

The progress of any given task can be provided by either an Alignment or AlignmentList instance. Generally, the tasks follow the progress of the AlignmentList instance, unless that instance contains only one Alignment object. In that case, the progress information is piped from the Alignment instance. For that reason, and for the Namespace (ns) object only, if the Namespace.sa attribute is set to True, then to not update the progress. Progress is being piped directly from the Alignment object.

Examples

Update the counter in an iteration of 100:

for i in range(100):
    self._update_pipes(ns=ns_obj, pbar=pbar_obj, value=i,
                       msg="Some string")

add_alignment_files(file_name_list, pbar=None, ns=None)[source]¶

Adds a list of alignment files to the current AlignmentList.

Adds a list of alignment paths to the current AlignmentList. Each path in the list is used to instantiate an Alignment object and several checks are performed to ensure that the alignments are correct and compliant with the other members of the AlignmentList object.

Parameters:

file_name_list : list

List with paths to sequence alignment files.

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

add_alignments(alignment_obj_list, ignore_paths=False)[source]¶

Add a list of Alignment objects to the current AlignmentList.

Adds a list of alignments, already as Alignment objects, to the current AlignmentList. This method performs several checks to assure that the Alignment objects being added are correct. If all checks out, updates all attributes of the class with the new alignments.

The ignore_paths argument can be used to prevent Alignment objects with short paths (only basename, for instance) from being added if they have the same basename and another in the alignments dict.

Parameters:

alignment_obj_list : list

List with Alignment objects.

ignore_paths : bool

If True, the Alignment.path of the new alignments are not checked with the loaded alignments.

alignments = None¶: Stores the “active” Alignment objects for the current AlignmentList. Keys will be the Alignment.path for quick lookup of Alignment object values

allele_frequency_spectrum(*args, **kwargs)[source]¶

Creates data for the allele frequency spectrum.

Calculates the allele frequency spectrum of the entire alignment data set. Here, multiple alignments are effectively treated as a single one. This method is exclusive of DNA sequence type and supports IUPAC ambiguity codes.

Parameters:

proportions : bool

If True, use proportions instead of absolute values

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“table_header”: list with headers of table

allele_frequency_spectrum_gene(*args, **kwargs)[source]¶

Creates data for the allele frequency spectrum of an Alignment.

Parameters:

gene_name : str

Alignment.name of an alignment.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“table_header”: list with headers of table,

“real_bin_num”:

aln_names()[source]¶: Returns list of basenames of alignments file paths.

average_seqsize(*args, **kwargs)[source]¶

Creates data for the average sequence size for the entire data set.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“table_header”: list with headers of table

average_seqsize_per_species(*args, **kwargs)[source]¶

Creates data for average sequence size per taxon.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“labels”: list, label for xticks,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title

bad_alignments = None¶: Attribute list that stores the Alignment.name attribute of badly formatted alignments

change_taxon_name(old_name, new_name)[source]¶: Changes the name of a taxon.

characters_proportion(*args, **kwargs)[source]¶

Creates data for the proportion of nt/aa for the data set.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“labels”: list, label for xticks,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“table_header”: list with headers of table,

characters_proportion_per_species(*args, **kwargs)[source]¶

Creates data for the proportion of nt/aa per species.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“labels”: list, label for xticks,

“ax_names”: 2 element list with axis labels [x, y],

“legend”: mpl Legend object,

“title”: str with title,

“table_header”: list with headers of table}

clear_alignments()[source]¶: Clears all attributes and data from the AlignmentList object.

code_gaps(*args, **kwargs)[source]¶

Code gaps in each Alignment object.

This wraps the execution of code_gaps method for each Alignment object in the alignments attribute.

Parameters:

table_in : string

Name of database table containing the alignment data that is used for this operation.

table_out : string

Name of database table where the final alignment will be inserted (default is “gaps”).

use_main_table : bool

If True, both table_in and table_out are ignore and the main table Alignment.table_name is used as the input and output table (default is False).

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

See also

Alignment.code_gaps

collapse(*args, **kwargs)[source]¶

Collapses equal sequences for each Alignment object.

This wraps the execution of collapse method for each Alignment object in the alignments attribute.

Parameters:

write_haplotypes : bool

If True, a file mapping the taxa names name to their respective haplotype name will be generated (default is True).

haplotypes_file : string

Name of the file mapping taxa names to their respective haplotype. Only used when write_haplotypes is True. If it not provided and write_haplotypes is True, the file name will be determined from Alignment.name.

dest : string

Path of directory where haplotypes_file will be generated (default is ”.”).

conversion_suffix : string

Suffix appended to the haplotypes file. Only used when write_haplotypes is True and haplotypes_file is None.

haplotype_name : string

Prefix of the haplotype string. The final haplotype name will be haplotype_name + <int> (default is “Hap”).

table_in : string

Name of database table containing the alignment data that is used for this operation.

table_out : string

Name of database table where the final alignment will be inserted (default is “collapsed”).

use_main_table : bool

If True, both table_in and table_out are ignore and the main table Alignment.table_name is used as the input and output table (default is False).

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

See also

Alignment.collapse

con = None¶: Database Connection object

concatenate(table_in='', ns=None, pbar=None)[source]¶

Concatenates alignments into a single Alignment object.

Concatenates multiple sequence alignments creating a single Alignment object and the auxiliary Partitions object defining the partitions of the concatenated alignment

Parameters:

table_in : string

Name of database table containing the alignment data that is used for this operation.

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

Returns:

concatenated_alignment : trifusion.process.sequence.Alignment

The Alignment object with the concatenated data and partitions.

Notes

Since sqlite queries are quite expensive, the previous approach of querying the sequence for each taxa AND alignment table was dropped. In this approach, sqlite will do all the heavy lifting. First, a temporary table with the same definition as all alignment tables is populated with data from all taxa and alignments. It is important that and index is created on the new txId column, which will redifine the txId of individual alignments according to the global taxa_names attribute. In the first procedure, there will be only one query per alignment. When the temporary table is complete, some sql operations are used to group sequences from each taxon for all alignments and then concatenated them, all in a single query. This query returns the concatenated sequences, corrected txId and taxon information, ready to populate the final concatenation table. This approach is an order o magnitude faster than the previous one where thousands of queries could be performed.

consensus(*args, **kwargs)[source]¶

Creates consensus sequences for each alignment.

This wraps the execution of consensus method for each Alignment object in the alignments attribute. It has an additional argument, single_file, which can be used to merge the consensus sequence from each alignment into a new Alignment object.

Parameters:

consensus_type : {“IUPAC”, “Soft mask”, “Remove”, “First sequence”}

Type of variation handling. See summary above.

single_file : bool

If True, the consensus sequence of each Alignment object will be merged into a single Alignment object.

table_in : string

Name of database table containing the alignment data that is used for this operation.

table_out : string

Name of database table where the final alignment will be inserted (default is “consensus”).

use_main_table : bool

If True, both table_in and table_out are ignore and the main table Alignment.table_name is used as the input and output table (default is False).

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

Returns

consensus_aln : trifusion.process.sequence.Alignment

Alignment object with all the consensus sequences merged. It’s only returned when the single_file parameter is set to True.

See also

Alignment.consensus

cumulative_missing_genes(*args, **kwargs)[source]¶

Creates data for a distribution of the maximum number of genes available for consecutive thresholds of missing data.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“labels”: list, label for xticks,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“table_header”: list with headers of table}

cur = None¶: Database Cursor object

duplicate_alignments = None¶: Attribute list that stores duplicate Alignment.name.

filter_by_taxa(taxa_list, filter_mode, ns=None, pbar=None)[source]¶

Filters alignments if they contain or exclude certain taxa.

Filters the alignments attribute by a given taxa list. The filtering may be performed to filter alignments depending on whether they include or exclude certain taxa.

Parameters:

taxa_list : list

List of taxa names.

filter_mode : str

Filter mode. Can be “contain” or “exclude”.

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

filter_codon_positions(*args, **kwargs)[source]¶

Filters codon positions in each Alignment object.

This wraps the execution of filter_codon_positions method for each Alignment object in the alignments attribute.

Parameters:

position_list : list

List of three bool elements that correspond to each codon position. Ex. [True, True, True] will save all positions while [True, True, False] will exclude the third codon position

table_in : string

Name of database table containing the alignment data that is used for this operation.

table_out : string

Name of database table where the final alignment will be inserted (default is “filter”).

use_main_table : bool

If True, both table_in and table_out are ignore and the main table Alignment.table_name is used as the input and output table (default is False).

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

See also

Alignment.filter_codon_positions

filter_informative_sites(*args, **kwargs)[source]¶

Filters Alignment objects according to informative sites number.

Filters alignments according to whether or not their number of informative sites is within a specified range. Alignment objects with a number of informative sites outside the range are move to the shelve_alignments and the counter of filtered_alignments is updated.

Parameters:

min_val : int

Minimum number of informative sites for the alignment to pass. Can be None, in which case there is no lower bound.

max_val : int

Maximum number of informative sites for the alignment to pass. Can be None, in which case there is no upper bound.

table_in : string

Name of database table containing the alignment data that is used for this operation.

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

See also

Alignment.filter_informative_sites

filter_min_taxa(min_taxa, ns=None, pbar=None)[source]¶

Filters alignments by minimum taxa proportion.

Filters Alignment objects based on a minimum taxa representation threshold. Alignments with less that the specified minimum taxa percentage will be moved to the filtered_alignments attribute.

NOTE: Since this filtering is meant to be performed when executing the process operations, it will permanently change the AlignmentList object, which means both self.alignments and self.partitions. Not doing so and removing/adding the partitions would create a great deal of conflicts that can be easily avoided by simply copying the AlignmentList object and modifying this object for the process execution

Parameters:

min_taxa : int

Percentage of minimum allowed taxa representation (e.g. 50).

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

filter_missing_data(*args, **kwargs)[source]¶

Filters missing data in each Alignment object.

This wraps the execution of filter_missing_data method for each Alignment object in the alignments attribute.

Parameters:

gap_threshold : int

Integer between 0 and 100 defining the percentage above which a column with that gap percentage is removed.

missing_threshold : int

Integer between 0 and 100 defining the percentage above which a column with that gap+missing percentage is removed.

table_in : string

Name of database table containing the alignment data that is used for this operation.

table_out : string

Name of database table where the final alignment will be inserted (default is “filter”).

use_main_table : bool

If True, both table_in and table_out are ignore and the main table Alignment.table_name is used as the input and output table (default is False).

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

See also

Alignment.filter_missing_data

filter_segregating_sites(*args, **kwargs)[source]¶

Filters Alignment objects according to segregating sites number.

Filters alignments according to whether or not their number of segregating sites is within a specified range. Alignment objects with a number of segregating sites outside the range are move to the shelve_alignments and the counter of filtered_alignments is updated.

Parameters:

min_val : int

Minimum number of segregating sites for the alignment to pass. Can be None, in which case there is no lower bound.

max_val : int

Maximum number of segregating sites for the alignment to pass. Can be None, in which case there is no upper bound.

table_in : string

Name of database table containing the alignment data that is used for this operation.

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

See also

Alignment.filter_segregating_sites

filtered_alignments = None¶: Records the number of filtered alignments by each individual filter type

format_list()[source]¶

Returns list of unique sequence types from Alignment objects.

Returns:

_ : list

List with unique sequence types for Alignment objects as strings.

gap_symbol = None¶: Dictionary with summary statistics for the active alignments

gene_occupancy(*args, **kwargs)[source]¶

Create data for gene occupancy plot.

Creates data for an interpolation plot to visualize the amount of missing genes in a alignment dataset.

ns : multiprocesssing.Manager.Namespace: A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title

get_gene_table_stats(active_alignments=None, sortby=None, ascending=True)[source]¶

Returns summary statistics for each Alignment.

Gets summary statistics for each individual gene for the gene table view of TriFusion. Summary statistics have already been calculated in get_summary_stats, so here we only get the active alignments for showing. Therefore, this function should only be called after get_summary_stats.

Parameters:

active_alignments : list

List of Alignment.name objects to show

sortby : str

Table header used for sorting. Can be {“nsites”, “taxa”, “var”, “inf”, “gap”, “missing”}.

ascending : bool

If True, the sortby column will be sorted in ascending order (default is True).

Returns:

summary_gene_table : pandas.DataFrame

DataFrame with summary statistics for each alignment.

table : list

List with the data for creating .csv tables.

get_summary_stats(active_alignments=None, ns=None)[source]¶

Calculates summary statistics for the ‘active’ alignments.

Creates/Updates summary statistics for the active alignments.

Parameters:

active_alignments : list

List of Alignment objects for which the summary statistics will be calculated

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

summary_stats : dict

Dictionary with the overall summary statistics

table : list

List with overall summary statistics for creating .csv tables.

get_tables()[source]¶

Return list with table_name of all Alignment objects.

Returns:

_ : list

List with Alignment.table_name.

iter_alignment_files()[source]¶

Returns an iterable on Alignment.path of each alignment.

Returns:

_ : iter

Iterable over the Alignment.path attribute of each alignment.

length_polymorphism_correlation(*args, **kwargs)[source]¶

Creates data for correlation between alignment length and informative sites.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“correlation”: bool, whether to perform spearman correlation,

“title”: str with title,

“table_header”: list with headers of table

missing_data_distribution(*args, **kwargs)[source]¶

Create data for overall distribution of missing data plot.

Creates data for overall distribution of missing data. This will calculate the amount of gaps, missing and actual data.

ns : multiprocesssing.Manager.Namespace: A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“legend”: mpl Legend object,

“table_header”: list with headers of table.

missing_data_per_species(*args, **kwargs)[source]¶

Creates data for missing data proportion per species plot.

For each taxon in taxon_list calculate the proportion of gap, missing and actual data.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“legend”: mpl Legend object,

“table_header”: list with headers of table,

“normalize”: bool, whether data values should be normalized,

“normalize_factor”: int, factor to use in normalization

missing_genes_average(*args, **kwargs)[source]¶

Creates histogram data with average missing taxa.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“table_header”: list with headers of table

missing_genes_per_species(*args, **kwargs)[source]¶

Creates data with distribution of missing genes per species plot.

Calculates, for each taxon, the number of alignments where it is present.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“labels”: list, label for xticks,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“table_header”: list with headers of table

non_alignments = None¶: Attribute list that stores sequence sets of unequal lenght

outlier_missing_data(*args, **kwargs)[source]¶

Create data for missing data alignment outliers.

Get data for outlier detection of genes based on the distribution of average missing data per gene. Data points will be based on the proportion of missing data symbols out of the possible total. For example, in an alignment with three taxa, each with 100 sites, the total possible missing data is 300 (100 * 3). Here, missing data will be gathered from all taxa and a proportion will be calculated based n the total possible

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“outliers”: list of outlier data points,

“outliers_labels”: list of outlier labels

outlier_missing_data_sp(*args, **kwargs)[source]¶

Create data for missing data taxa outliers.

Gets data for outlier detection of species based on missing data. For this analysis, genes for which a taxa is completely absent will be ignored for the calculations of that taxa. The reason for this, is that including genes where the taxon is absent would bias the outlier detection towards taxa that have low prevalence in the data set, even if they have low missing data in the alignments where they are present.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“outliers”: list of outlier data points,

“outliers_labels”: list of outlier labels}

outlier_segregating(*args, **kwargs)[source]¶

Creates data for segregating site alignment outliers.

Generates data for the outlier detection of genes based on segregating sites. The data will be based on the number of alignments columns with a variable number of sites, excluding gaps and missing data

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“outliers”: list of outlier data points,

“outliers_labels”: list of outlier labels

outlier_segregating_sp(*args, **kwargs)[source]¶

Create data for segregating sites taxa outliers.

Generates data for the outlier detection of species based on their average pair-wise proportion of segregating sites. Comparisons with gaps or missing data are ignored

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“outliers”: list of outlier data points,

“outliers_labels”: list of outlier labels

outlier_sequence_size(*args, **kwargs)[source]¶

Creates data for sequence size alignment outliers.

Generates data for the outlier detection of genes based on their sequence length (excluding missing data).

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“outliers”: list of outlier data points,

“outliers_labels”: list of outlier labels

outlier_sequence_size_sp(*args, **kwargs)[source]¶

Create data for sequence size taxa outliers.

Generates data for the outlier detection of species based on their sequence length (excluding missing data)

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“outliers”: list of outlier data points,

“outliers_labels”: list of outlier labels

path_list = None¶: List of Alignment.path

remove_file(filename_list)[source]¶

Removes alignments.

Removes Alignment objects from alignments and shelve_alignments based on their name attribute.

Parameters:

filename_list : list

List of Alignment.name to be removed.

remove_tables(preserve_tables=None, trash_tables=None)[source]¶

Drops tables from the database.

This method can be used to drop tables from the database.

If no arguments are provided it will remove all database tables associated with the AlignmentList object. If the preserve_tables argument is provided, the table names present in this list will not be dropped. If trash_tables is provided, only the tables names specified in that list are dropped. If both are provided, trash_tables will take precedence.

Parameters:

preserve_tables : list, optional

If provided, the table names in this list will NOT be dropped.

trash_tables : list, optional

If provided, only the table names in this list will be dropped. Takes precedence over preserve_tables if both are provided.

remove_taxa(taxa_list, mode='remove')[source]¶

Removes the specified taxa.

Removes the specified taxa from the global AlignmentList attributes and from each Alignment object.

This method supports a list or path to CSV file as the taxa_list_file argument. In case of CSV path, it parses the CSV file into a list. The CSV file should be a simple text file containing taxon names in each line.

This method also supports two removal modes.

remove: removes the specified taxa
inverse: removes all but the specified taxa

Parameters:

taxa_list_file : list or string

List containing taxa names or string with path to a CSV file containig the taxa names.

mode : {“remove”, “inverse”}

The removal mode (default is remove).

See also

Alignment.remove_taxa

resume_database()[source]¶

Reconnects to the sqlite database.

The connection to the database can be closes or even severed when passing objects between threads. This methods allows the reconnection of the database for the AlignmentList and all (even the shelved ones) Alignment objects.

retrieve_alignment(name)[source]¶

Return Alignment object with a given name.

Retrieves the Alignment object with a matching name attribute from alignments or shelve_alignments. Returns None if no match is found.

Parameters:

name : str

name attribute of an Alignment object.

Returns:

_ : trifusion.process.sequence.Alignment

Alignment object.

reverse_concatenate(ns=None)[source]¶

Reverse a concatenated file according to the partitions.

This is basically a wrapper of the reverse_concatenate method of Alignment and is rarely useful. It is preferable to retrieve an Alignment object, provide some partitions and then perform the reverse concatenation. Since the reverse concatenation operation can only be applied to Alignment objects, this method actually creates a concatenated Alignment .

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

reverted_alns : AlignmentList

AlignmentList with the partitions as Alignment objects.

select_by_taxa(taxa_list, mode='strict')[source]¶

Selects a list of Alignment objects by taxa.

This method is used to select Alignments based on whether they contain or exclude certain taxa. The selected alignments are returned in a list.

Parameters:

taxa_list : list

List of taxa names.

mode : str

Selection mode. Can be:

“strict”: Selects alignments containing all and only the

provide taxa. - “inclusive”: Selects alignments containing the provided taxa. - “relaxed”: Selects alignments if they contain at least one of the provided taxa.

Returns:

selected_alignments : list

List of Alignment objects.

sequence_code = None¶: Tuple with the AlignmentList sequence code. Either (“DNA”, “n”) or (“Protein”, “x”)

sequence_segregation(*args, **kwargs)[source]¶

Creates data for overall distribution of segregating sites.

Parameters:

proportions : bool

If True, use proportions instead of absolute values

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“table_header”: list with headers of table,

“real_bin_num”:

sequence_segregation_gene(*args, **kwargs)[source]¶

Create data for a sliding window analysis of segregating sites.

Retrieves an alignment using gene_name and calculates the number of segregating sites along the alignment length using a sliding window approach.

Parameters:

gene_name : str

Alignment.name from an alignment

window_size : int

Size of sliding window.

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“window_size”: int, size of sliding window,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“table_header”: list with headers of table

sequence_segregation_per_species(*args, **kwargs)[source]¶

Creates data for a triangular matrix of sequence segregation for pairs of taxa.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“labels”: list, label for xticks,

“color_label”: str, label for colorbar

sequence_similarity(*args, **kwargs)[source]¶

Creates data for average sequence similarity plot.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“ax_names”: 2 element list with axis labels [x, y]

sequence_similarity_gene(*args, **kwargs)[source]¶

Creates data for sliding window sequence similarity for alignment.

Retrieves an alignment using gene_name and calculates the average pair-wise sequence similarity along the alignment length using a sliding window approach.

Parameters:

gene_name : str

Alignment.name from an alignment

window_size : int

Size of sliding window.

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“window_size”: int, size of sliding window,

“ax_names”: 2 element list with axis labels [x, y],

“title”: str with title,

“table_header”: list with headers of table

sequence_similarity_per_species(*args, **kwargs)[source]¶

Creates data for sequence similarity per species pair plot.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“labels”: list, label for xticks,

“color_label”: str, label for colorbar

set_database_connections(cur, con)[source]¶

Provides Connection and Cursor to Alignment objects.

Sets the database connections manually for all (even the shelved ones) Alignment objects.

Parameters:

cur : sqlite3.Cursor

Provide a database Cursor object.

con : sqlite3.Connection

Provide a database Connection object.

set_partition_from_alignment(alignment_obj)[source]¶

Updates partitions object with the provided Alignment object.

Uses the partitions of an Alignment object to set the partitions attribute of AlignmentList

Parameters:

alignment_obj : trifusion.process.sequence.Alignment

Alignment object.

shelve_alignments = None¶: Stores the “inactive” or “shelved” Alignment objects. All AlignmentList methods will operate only on the alignments attribute, unless explicitly stated otherwise. Key-value is the same as the alignments attribute.

shelved_taxa = None¶: List with non active taxa

sql_path = None¶: Path to sqlite3 database file

summary_gene_table = None¶: Lists the currently active tables. This is mainly used for the conversion of the consensus alignments into a single Alignment object

summary_stats = None¶: Dictionary with summary statistics for each alignment.

taxa_distribution(*args, **kwargs)[source]¶

Creates data for a distribution of taxa frequency across alignments.

Parameters:

ns : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

Returns:

_ : dict

“data”: numpy.array with data for plotting,

“labels”: list, label for xticks,

“title”: str with title,

“color_label”: str, label for colorbar,

“real_bin_num”:

taxa_names = None¶: List with the name of the taxa included in the AlignmentList object

update_active_alignment(aln_name, direction)[source]¶

Updates the ‘active’ status of a single alignment.

Changes the ‘active’ status of a single alignment in a specified direction.

After updating the alignments`and `shelve_alignments attributes, the taxa_names attribute is also updated.

Parameters:

aln_name : str

Alignment.name to update.

direction : str

Direction where aln_name will be moved. Can be {“shelve”, “active”}.

update_active_alignments(aln_list=None, all_files=False)[source]¶

Sets the active alignments.

Sets the ‘active’ alignments stored in the alignments attribute and the ‘inactive’ alignments stored in the shelve_alignments attribute, based on the aln_list argument. Regardless of the current ‘active’ and ‘inactive’ alignments, calling this method with the aln_list argument will set only those as the ‘active’ alignments.

Alternatively, the all_files bool argument can be provided to set all alignments as ‘active’.

After updating the alignments`and `shelve_alignments attributes, the taxa_names attribute is also updated.

Parameters:

aln_list : list

List of Alignment.name that will be set as ‘active’.

all_files : bool

If True, all alignments become active. Ignores aln_list.

update_taxa_names(taxa_list=None, all_taxa=False)[source]¶

Sets the active taxa.

Sets the ‘active’ taxa stored in the taxa_names attribute and the ‘inactive’ taxa stored in the shelved_taxa attribute, based on the taxa_list argument.

Alternatively, the all_taxa bool argument can be provided to set all taxa as ‘active’.

Parameters:

taxa_list : list

List of taxon names that will be set as ‘active’.

all_taxa : bool

If True, all alignments become active. Ignores taxa_list.

write_taxa_to_file(file_name='Taxa_list.csv')[source]¶

Writes the taxa in taxa_list to a file.

Parameters:

file_name : string

Path to the generated file (default is ‘Taxa_list.csv’).

write_to_file(output_format, conversion_suffix='', output_suffix='', *args, **kwargs)[source]¶

Writes Alignment objects into files.

Wrapper of the write_to_file method of the Alignment object.

Parameters:

output_format : list

List with the output formats to generate. Options are: {“fasta”, “phylip”, “nexus”, “stockholm”, “gphocs”, “ima2”}.

output_file : str

Name of the output file. If using ns_pipe in TriFusion, it will prompt the user if the output file already exists.

tx_space_nex : int

Space (in characters) provided for the taxon name in nexus format (default is 40).

tx_space_phy : int

Space (in characters) provided for the taxon name in phylip format (default is 40).

tx_space_ima2 : int

Space (in characters) provided for the taxon name in ima2 format (default is 10).

cut_space_nex : int

Set the maximum allowed character length for taxon names in nexus format. Longer names are truncated (default is 39).

cut_space_phy : int

Set the maximum allowed character length for taxon names in phylip format. Longer names are truncated (default is 39).

cut_space_ima2 : int

Set the maximum allowed character length for taxon names in ima2 format. Longer names are truncated (default is 8).

interleave : bool

Determines whether the output alignment will be in leave (False) or interleave (True) format. Not all output formats support this option.

gap : str

Symbol for alignment gaps (default is ‘-‘).

model_phylip : str

Substitution model for the auxiliary partition file of phylip format, compliant with RAxML.

outgroup_list : list

Specify list of taxon names that will be defined as the outgroup in the nexus output format. This may be useful for analyses with MrBayes or other software that may require outgroups.

ima2_params : list

A list with the additional information required for the ima2 output format. The list should contains the following information:

(str) path to file containing the species and populations.

(str) Population tree in newick format, e.g. (0,1):2.

(str) Mutational model for all alignments.

(str) inheritance scalar.

use_charset : bool

If True, partitions from the Partitions object will be written in the nexus output format (default is True).

partition_file : bool

If True, the auxiliary partitions file will be written (default is True).

output_dir : str

If provided, the output file will be written on the specified directory.

phy_truncate_names : bool

If True, taxon names in phylip format are truncated to a maximum of 10 characters (default is False).

ld_hat : bool

If True, the fasta output format will include a first line compliant with the format of LD Hat and will truncate sequence names and sequence length per line accordingly.

use_nexus_models : bool

If True, writes the partitions charset block in nexus format.

ns_pipe : multiprocesssing.Manager.Namespace

A Namespace object used to communicate with the main thread in TriFusion.

pbar : ProgressBar

A ProgressBar object used to log the progress of TriSeq execution.

table_suffix : string

Suffix of the table from where the sequence data is fetched. The suffix is append to Alignment.table_name.

table_name : string

Name of the table from where the sequence data is fetched.

exception trifusion.process.sequence.AlignmentUnequalLength[source]¶

Bases: exceptions.Exception

Raised when sequences in alignment have unequal length.

class trifusion.process.sequence.LookupDatabase(func)[source]¶

Bases: object

Decorator handling hash lookup table with pre-calculated values.

This decorator class is used to decorate class methods that calculate pairwise sequence similarity. To ensure proper functionality, the decorated method should be called first with a single “connect” argument and after finishing all calculations, with a single “disconnect” argument. These special method callings are necessary to setup and close the database storing the calculated values, respectively.

Parameters:

func : function

Decorated function

Notes

The con and c attributes are initialized when the decorated function is called with the single “connect” argument (e.g. decorated_func(“connect”)). When the decorated function is called with the single “disconnect” argument (e.g. decorated_func(“disconnect”)), the database changes are committed, the sqlite Connection is closed and the Cursor is reset.

Attributes

func	(function) Decorated function
con	(None or sqlite.Connection) Connection object of the sqlite database
c	(None or sqlite.Cursor) Cursor object of the sqlite database

Methods

__call__(*args) Wraps the call of func.

__call__(*args)[source]¶

Wraps the call of func.

When the decorated method is called, this code is wrapped around its execution. It accepts an arbitrary number of positional arguments but is is currently designed to decorate the _get_similarity method of the AlignmentList class. There are three expected calling modes:

1. decorated_func(“connect”) : Initializes the Connection and Cursor objects of the database 2. decorated_func(seq1, seq2, locus_lenght) : The typical main execution of the method, providing the sequence1 and sequence2 strings along with the integer with their lenght 3. decorated_func(“disconnect”) : Commits changes to database and closes Connection and Cursor objects.

The path of this sqlite database is automatically obtained from the AlignmentList attribute sql_path. We use the path to the same directory, but with a different name for the database, “pw.db”.

Parameters:

args : list

List of positional arguments for decorated method

Notes

When the decorated method is called normally (i.e., not with a “connect” or “disconnect” argument”), it first creates an hash of the argument list (excluding self). This hash is used to query the database to check if a value for that combination of sequence strings and sequence length has already been calculated. If yes, the result is immediately returned without executing the decorated method. If no, the method is executed to perform calculations. The result of this calculation is then stored in the database and the result is returned.

trifusion.process.sequence.check_data(func)[source]¶

Decorator handling the result from AlignmentList plotting methods.

This should decorate all plotting methods from the AlignmentList class. It can be used to control warnings and handle exceptions when calling the plotting methods. Currently, it ensures that a list of methods are not executed when the AlignmentList instance contains a single alignment, and handles cases where the plotting methods return an empty array of data. Further control can be added here to prevent methods from being executed in certain conditions and handling certain outputs of the plotting methods.

The only requirement is that, even when the plotting methods are not executed, this should always return a dictionary. In such cases, this dictionary should contain a single key:value with {“exception”:<exception_type_string>}. The <exception_type_string> should be a simple string with an informative name that will be handled in the stats_write_plot of the TriFusionApp class.

Parameters:

func : function

Decorated function

Returns:

res : dict

The value returned by func is always a dictionary. If the decorator prevents the execution of func for some reason, ensure that a dictionary is always return with a single key:value {“exception”: <exception_type_string>}

trifusion.process.sequence.setup_database(func)[source]¶

Decorator handling the active database tables.

Decorates methods from the Alignment object that use and perform modifications to the original alignment data. All methods decorated with this must have the keyword arguments table_in and table_out (and use_main_table, optionally). The values associated with these arguments will determine which tables will be used and created before the execution of the decorated method (See Notes for the rationale). The strings provided as arguments for table_in and table_out will serve as a suffix to the Alignment instance attribute table_name.

If use_main_table is provided, and is set to True, both table_in and table_out will default to Alignment.table_name. Values provided for table_in and table_out at calling time will be ignored.

The following cases assume use_main_table is not provided or set to False.

If both table_in and table_out are provided, table_out is modified so that table_out = Alignment.table_name + table_out. If table_out is not provided, it defaults to Alignment.table_name.

If the final table_out does not exist in the database, create it. In this case, if table_in will default to Alignment.table_name. If table_out already exists in the database, then table_in=table_out.

Parameters:

func : function

Decorated function

Notes

In order to fully understand the mechanism behind the setup of the database tables, one must first know that when an Alignment instance is created, it generates a table in the database containing the original data from the alignment. In TriFusion (GUI), this original table MUST NOT be modified, since users may want to execute several methods on the same Alignment object. Therefore, when a particular method needs to modify the original alignment, a new temporary table is created to store the modified version until the end of the execution. If a second modification is requested, it will also be necessary to set the input table as the output table of the previous alignment modification. In the case of TriSeq (CLI), there is no such requirement, so it’s much simpler to use the same table for all modifications.

This decorator greatly simplifies this process in the same way for all methods of the Alignment object that modify the original alignment data. To accomplish this, all decorated method must have the keyword arguments: table_in and table_out (and use_main_table, optionally).

For methods called within the execution of TriFusion, the idea is simple. Since we don’t know which methods will be used by the user, all chained methods that will create the same output file can be called with table_in=new_table and table_out=new_table. Note that bot arguments have the same value. Whatever is the first method being called, it will face the fact that “new_table” does not yet exist. In this case, the decorator will create a “new_table” in the database and reset table_in to the original Alignment.table_name. This ensures that the first method will still be able to fetch the alignment data. In the following methods, table_in and table_out will be used based on the original values, ensuring that the alignment data is being fetched from the last modification made and that the modification chain is maintained. In this way, the execution of Alignment methods can be the same, regardless of the order or number of operations requested by the user.

In the case of TriSeq, the methods should set use_main_table=True. This will always set table_in=table_out=Alignment.table_name, which means that the main database will be used for all methods.

trifusion.process.sequence.setup_intable(func)[source]¶

Decorator handling active database table when fetching data.

This class is mean to decorate methods of the Alignment class that retrieves alignment data from the database. The requirement is that these methods have the positional arguments table_suffix, table_name and table. These are all optional, and only table_suffix and table_name should be used when calling these methods. The values of these two will be used to define the value of table, so any value provided to this argument when calling the decorated method will be ignored. In the end, the table variable will contain the final table name and only this variable will be used to retrieve the alignment data.

table_suffix is a suffix that is appended to Alignment.table_name. table_name defines a complete table name.

If neither table_suffix nor table_name are provided, then table will default to Alignment.table_name.

If table_name is provided, table=table_name. If table_suffix is provided, table=Alignment.table_name + table_suffix.

If both are provided, table_name takes precedence over table_suffix and table=table_name.

In any case, we test the existence of the final table value in the database. If it does not exist, table will revert to Alignment.table_name to prevent errors.

Parameters:

func : function

Decorated function

Attributes

func	(function) Decorated function