Source code for trifusion.process.sequence

"""The `sequence` module of TriFusion contains the main classes handling
alignment sequence data. These are :class:`.Alignment` and
:class:`.AlignmentList`. Here follows a brief explanation of how these
classes work and how to deal with the sqlite database.

`Alignment` class
-----------------

The :class:`.Alignment` class is the main interface with single alignment
files. It can be viewed as the building block of an :class:`.AlignmentList`
object, which can have one or more :class:`.Alignment` objects. It contains
all methods and attributes that pertain to a given alignment and are used to
retrieve information or modify it. **However, it is NOT meant to be used
independently**, but rather within the context of an :class:`.AlignmentList`
object. The data from each alignment is stored in a single sqlite database
during the execution of TriFusion or the TriSeq/TriStats CLI programs. The
connection to this database is automatically handled by
:class:`.AlignmentList` for all :class:`.Alignment` objects included in it.
In this way, we can use the :class:`.AlignmentList` class to handle the
setup of the sqlite3 database, and focus on single alignment data handling
in general in this class.

The main types of methods defined in this class are:

Parsers
~~~~~~~

Parsing methods are defined for each format with `_read_<format>`:

    - :meth:`~.Alignment._read_phylip`: Parses phylip format.
    - :meth:`~.Alignment._read_fasta`: Parses fasta format.
    - :meth:`~.Alignment._read_nexus`: Parses nexus format.
    - :meth:`~.Alignment._read_loci`: Parses pyRAD/ipyrad loci format.
    - :meth:`~.Alignment._read_stockholm`: Parses stockholm format.

They are always called from the :meth:`~.Alignment.read_alignment` method,
and not directly. When an :class:`.Alignment` object is instantiated with a
path to an alignment file, it automatically detects the format of the
alignment and keeps that information on the :attr:`~.Alignment.input_format`
attribute. The :meth:`~.Alignment.read_alignment` method then calls the
parsing method corresponding to that format. That information is stored in a
dictionary::

    parsing_methods = {
        "phylip": self._read_phylip,
        "fasta": self._read_fasta,
        "loci": self._read_loci,
        "nexus": self._read_nexus,
        "stockholm": self._read_stockholm
    }

    # Call the appropriate method
    parsing_methods[self.input_format]()

Each format has its own parsing method (which can be modified directly). To
add a new format, it is necessary to add it to the automatic format
recognition in :meth:`~trifusion.process.base.Base.autofinder`. Then,
create the new parsing method, using the same `_read_<format>` notation and
add it to the `parsing_methods` dictionary in
:meth:`~.Alignment.read_alignment`.

New parsers must insert alignment data into a table in the sqlite database.
This table is automatically created when the :class:`.Alignment` object
is instantiated, and its name is stored in the :attr:`~.Alignment.table_name`
attribute (see :meth:`~.Alignment._create_table`)::

    cur.execute("CREATE TABLE [{}]("
                "txId INT,"
                "taxon TEXT,"
                "seq TEXT)".format(table_name))

To insert data into the database, a taxon id (`txId`), taxon name (`taxon`)
and sequence (`seq`) must be provided. For example::

    cur.execute("INSERT INTO [{}] VALUES (?, ?, ?)".format(self.table_name),
                (0, "spa", "AAA"))

Data fetching
~~~~~~~~~~~~~

To facilitate fetching alignment data from the database, several generators
and data retrieval methods are defined:

    - :meth:`~.Alignment.iter_columns`: Iterates over the columns of the
      alignment.
    - :meth:`~.Alignment.iter_columns_uniq`: Iterates over the columns of
      the alignment but yields only unique characters.
    - :meth:`~.Alignment.iter_sequences`: Iterates over each sequence in the
      alignment.
    - :meth:`~.Alignment.iter_alignment`: Iterates over both taxon name
      and sequence in the alignment.
    - :meth:`~.Alignment.get_sequence`: Return the sequence from a particular
      taxon.

These should always use the
:func:`.setup_intable` decorator and defined with the `table_suffix`, `table_name`
and `table` arguments. For instance, the :meth:`.Alignment.iter_sequences` method is a
generator that allows the iteration over the sequences in the alignment
and is defined as::

    @setup_intable
    def iter_sequences(self, table_suffix="", table_name=None, table=None):

When calling these methods, only the `table_suffix` and `table_name` have to
be provided. In fact, the value provided to the `table` argument at calling
time is ignored. The decorator will check the values of both `table_suffix`
and `table_name` and evaluate the database table that will be used. This
final table name will then be provided as the `table` argument value within
the decorator. In this way, these methods can be called like::

    for seq in self.iter_sequences(table_suffix="_collapse"):
        # Do stuff

In this case, the :func:`.setup_intable` decorator will append the
`table_suffix` to the name of the original alignment table. If
:attr:`.Alignment.table_name`="main"`, then the final table name in this
case will be "main_collapse".

Alternatively, we can use `table_name`::

    for seq in self.iter_sequences(table_name="main_2"):
        # Do stuff

In this case, the final table name will be "main_2".

If the final table name does not exist, the method falls back to the original
table name defined by the `table_name` attribute.

Alignment modifiers
~~~~~~~~~~~~~~~~~~~

Methods that perform modifications to the alignment are also defined here.
These include:

    - :meth:`~.Alignment.collapse`: Collapses alignment into unique sequences.
    - :meth:`~.Alignment.consensus`: Merges alignment into a single sequence.
    - :meth:`~.Alignment.filter_codon_positions`: Filters alignment columns
      according to codon position.
    - :meth:`~.Alignment.filter_missing_data`: Filters columns according to
      missing data.
    - :meth:`~.Alignment.code_gaps`: Code alignment's indel patterns as a
      binary matrix.

For example, the :meth:`.Alignment.collapse` method transforms the original
alignment into a new one that contains only unique sequences. An important
factor to take into account with alignment modifying methods, is that it may
be important to preserve the original alignment data for future operations.
In TriFusion, the original alignment must be available at all times since
users may perform any number of process executions in a single session.
Therefore, all methods that can potentially modify the original alignment
need to be decorated with the :func:`.setup_database` function, and must be
defined with at least the `table_in` and `table_out` arguments. The
decorator and these arguments will work together to determine the database's
table from where the data will be fetched, and to where the modified
alignment will be written. For instance, the :meth:`.Alignment.collapse`
method is defined as::

    @setup_database
    def collapse(..., table_in=None, table_out="collapsed",
                 use_main_table=False):

If we want to perform a collapse of the original alignment and store the
modified alignment in a new table, we could call :meth:`.Alignment.collapse`
like::

    collapse(table_out="new_table")

The :func:`.setup_database` decorator interprets `table_in=None` as an
instruction to use the table with the original alignment, and stores the
modified alignment in `"new_table"`.

However, we may want to perform a collapse operation after a previous
modification from other method. In that case, we can specify a
`table_in`::

    collapse(table_in="other_table", table_out="collapse")

One issue with this approach is that we do not know *a priori* which
operations will be requested by the user nor the order. If one execution
performs, say, a `consensus` and a `collapse`, the new table should be
created in `consensus` and then used as input in `collapse`. However,
if only `collapse` is called, then the new table should only be created
there. To solve this issue, the `setup_database` decorator is smart about
its arguments. We can create a sequence of operations with the same
`table_in` and `table_out` arguments::

    new_table = "master_table"
    if "consensus" in operations:
        consensus(table_in=new_table, table_out=new_table)
    if "collapse" in operations:
        collapse(table_in=new_table, table_out=new_table)

In this simple pipeline, the user may perform either a `consensus`,
a `collapse`, or both. When the first method is called,
the :func:`.setup_database` decorator will check if the table provided in
`table_in` exists. In the first called method it will not exist, so instead
of returning an error, it falls back to the original alignment table and
then writes the modified alignment to "master_table". In the second method,
`table_in` already exists, so it fetches alignment data from the
"master_table". This will work whether these methods are called individually
or in combination.

When there is no need to keep the original alignment data (in single
execution of TriSeq, for instance), the special `use_main_table` argument
can be provided to tell the method to use the original table as the input
and output table. If this argument is True, it supersedes any information
provided by `table_in` or `table_out`::

    collapse(use_main_table=True)

Writers
~~~~~~~

Like parsers, writer methods are defined with `_write_<format>`:

    - :meth:`~.Alignment._write_fasta`: Writes to fasta format.
    - :meth:`~.Alignment._write_phylip`: Writes to phylip format.
    - :meth:`~.Alignment._write_nexus`: Writes to nexus format.
    - :meth:`~.Alignment._write_stockholm`: Writes to stockholm format.
    - :meth:`~.Alignment._write_gphocs`: Writes to gphocs format.
    - :meth:`~.Alignment._write_ima2`: Writes to IMa2 format.
    - :meth:`~.Alignment._write_mcmctree`: Writes to MCMCTree format.

They are always called from the :meth:`.Alignment.write_to_file` method,
not directly. When the :meth:`.Alignment.write_to_file` method is called,
a list with the requested output formats is also provided as an argument.
For each format specified in the argument, the corresponding writer method
is called. That method is responsible for fetching the data from the
database and write it to an output file in the appropriate format. The map
between the formats and the methods is stored in a dictionary::

    write_methods = {
        "fasta": self._write_fasta,
        "phylip": self._write_phylip,
        "nexus": self._write_nexus,
        "stockholm": self._write_stockholm,
        "gphocs": self._write_gphocs,
        "ima2": self._write_ima2,
        "mcmctree": self._write_mcmctree
    }

    # Call apropriate method for each output format
    for fmt in output_format:
        write_methods[fmt](output_file, **kwargs)

The `output_file` and a `kwargs` dictionary are provided as arguments to
each of these methods. The `kwargs` dictionary contains all keyword
arguments used when calling `write_to_file` and each writer method fetches
the ones relevant to the format. For instance, in the beginning of the
:meth:`~.Alignment._write_fasta` method, the relevant keyword arguments are
retrieved::

    # Get relevant keyword arguments
    ld_hat = kwargs.get("ld_hat", False)
    interleave = kwargs.get("interleave", False)
    table_suffix = kwargs.get("table_suffix", None)
    table_name = kwargs.get("table_name", None)
    ns_pipe = kwargs.get("ns_pipe", None)
    pbar = kwargs.get("pbar", None)

In addition to the `write_methods` dictionary, a dictionary mapping the
formats and their corresponding file extensions is also defined::

    format_ext = {"ima2": ".txt",
                  "mcmctree": "_mcmctree.phy",
                  "phylip": ".phy",
                  "nexus": ".nex",
                  "fasta": ".fas",
                  "stockholm": ".stockholm",
                  "gphocs": ".txt"}

To add a new output format writer, simply create the method using the
`_write_<format>` notation and include it in the `write_methods` dictionary,
along with the extension in the `format_ext` variable.

`AlignmentList` class
---------------------

The :class:`.AlignmentList` is the main interface between the user and the
alignment data. It may contain one or more :class:`.Alignment` objects,
which are considered the building blocks of the data set. These
:class:`.Alignment` objects are bound by the same sqlite database connection.

How to create an instance
~~~~~~~~~~~~~~~~~~~~~~~~~

An :class:`.AlignmentList` instance can be created with a single argument,
which is a list of paths to alignment files::

    aln_obj = AlignmentList(["file1.fas", "file2.fas"])

Even when no information is provided, an sqlite database connection is
established automatically (generating a "trifusion.sqlite3" file in the
current working directory by default). However, it is possible and advisable
to specify the path to the sqlite database::

    aln_obj = AlignmentList(["file1.fas", "file2.fas"], sql_path=".sql.db")

A connection to the database can also be provided when instantiating the
class, so it's perhaps more useful to see how the
:meth:`~.AlignmentList.__init__` works::

    def __init__(self, alignment_list, sql_db=None, db_cur=None, db_con=None,
                 pbar=None):

        if db_cur and db_con:
            self.con = db_con
            self.cur = db_cur
        elif sql_db:
            self.sql_path = sql_db
        else:
            self.sql_path = "trifusion.sqlite3"

As we can see, if a database connection is provided via the `db_cur` and
`db_con` arguments, the `sql_path` is ignored. If no sqlite information
is provided, the `sql_path` attribute defaults to "trifusion.sqlite3".

Adding alignment data
~~~~~~~~~~~~~~~~~~~~~

Alignment data can be loaded when initializing the class as above and/or
added later via the :meth:`~AlignmentList.add_alignments` and
:meth:`~AlignmentList.add_alignment_files` methods (in fact,
the :meth:`~AlignmentList.add_alignment_files` method is the one called at
:meth:`~.AlignmentList.__init__`)::

    aln_obj.add_alignment_files(["file3.fas", "file4.fas"])

The most important aspects when adding alignment data is how each alignment
is processed and how errors and exceptions are handled. Briefly, the flow is:

  1. Check for file duplications within the loaded file list. Duplications
     are stored in the `duplicate_alignments` attribute.
  2. Check for file duplications between the loaded files and any
     alignment already present. The duplications go to the same attribute
     as above.
  3. For each provided file, create an :class:`.Alignment` object. Errors
     that occur when creating an `Alignment` object are stored in its
     :attr:`~.Alignment.e`
     attribute. This attribute is then checked before adding the alignment to
     the :class:`.AlignmentList` object.
      1. Check for `InputError` exception. These are malformated files
         and are stored in the `bad_alignments` attribute.
      2. Check for `AlignmentUnequalLength` exception. These are sequence
         sets of
         unequal length (not alignments) and are stored in the
         `non_alignments` attribute.
      3. Check for `EmptyAlignment` exception. These are empty alignments
         and are
         stored in the `bad_alignments` attribute.
      4. Check the sequence code of the `Alignment` object. For now,
         the `AlignmentList` object only accepts alignments of the same
         sequence type (e.g. DNA, protein).
      5. If all of the previous checks pass, add the `Alignment` object
         to the `alignments` attribute.
  4. Update the overall taxa names of the full data set after the
     inclusion of the new alignments.

Wrapping `Alignment` methods
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The majority of the methods defined in the :class:`.Alignment` object can
also be accessed in the :class:`.AlignmentList` object. These are defined
roughly with the same arguments in both classes so that their behavior is
the same. These can be simple wrappers that call the respective
:class:`.Alignment` method for each alignment in the
:attr:`~.AlignmentList.alignments` attribute. For instance,
the :meth:`~.AlignmentList.change_taxon_name` method is simply::

    def change_taxon_name(self, old_name, new_name):

        for alignment_obj in list(self.alignments.values()):
            alignment_obj.change_taxon_name(old_name, new_name)

        self.taxa_names = [new_name if x == old_name else x
                           for x in self.taxa_names]

To avoid duplicating the argument list, the wrapping method can use `args`
and `kwargs` to transfer arguments. This ensures that if the argument list
is modified in the :class:`.Alignment` method, it doesn't need any
modification in the wrapper method. For instance,
the :meth:`~.Alignment.write_to_file` method of :class:`.Alignment` accepts
a large number of positional and keyword arguments, which would be an hassle
to define an maintain in the wrapper method of :class:`.AlignmentList`. So,
the :meth:`~:AlignmentList.write_to_file` method of :class:`.AlignmentList`
is simply defined as::

    def write_to_file(self, output_format, conversion_suffix="",
                      output_suffix="", *args, **kwargs):

The `output_format`, `conversion_suffix` and `output_suffix` are the only
positional arguments required when calling this wrapper. All the remaining
arguments are packed in the `args` and `kwargs` objects and used
normally by the wrapped method.

Exclusive `AlignmentList` methods
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some methods are exclusive of :class:`.AlignmentList` because they only make
sense to be applied to lists of alignments (e.g.
:meth:`~.AlignmentList.concatenate`). These have more freedom in how they
are defined and called.

Active and inactive datasets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Taxa and/or alignments can become 'inactive', that is, they are temporarily
removed from their respective attributes, `taxa_names` and `alignments`.
This means that these 'inactive' elements are ignored when performing most
operations. To change the 'active' status of alignments,
the :meth:`~AlignmentList.update_active_alignments` and
:meth:`~AlignmentList.update_active_alignment` methods are available. For
taxa, the :meth:`~AlignmentList.update_taxa_names` method can be used::

    # Set only two active alignments
    self.update_active_alignments(["file1.fas", "file2.fas"])

    # Set only two active taxa
    self.update_taxa_names(["taxonA", "taxonB"])

Note that all these modifications are reversible. 'Inactive' elements are
stored in the :attr:`~.AlignmentList.shelve_alignments` attribute for
alignments, and :attr:`~.AlignmentList.shelved_taxa` for taxa.

Updating `Alignment` objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some tasks perform changes to core attributes of :class:`.AlignmentList`,
but they may also be necessary on each :class:`.Alignment` object. For
instance, the :meth:`~.AlignmentList.remove_taxa` method is used to remove a
list of taxa from the :class:`.AlignmentList` object. It is easy to change
only the relevant :class:`.AlignmentList` attributes, but this change also
requires those particular taxa to be removed in all :class:`.Alignment`
objects. For this reason, such methods should be defined in the same way in
both classes. Using the :meth:`~.AlignmentList.remove_taxa` example::

    def remove_taxa(self, taxa_list, mode="remove"):

        # <changes to AlignmentList>

        for alignment_obj in list(self.alignments.values()):
            alignment_obj.remove_taxa(taxa_list, mode=mode)

As you can see, the usage is the same for both methods.

Plot data methods
~~~~~~~~~~~~~~~~~

The :class:`.AlignmentList` object contains all methods that generate data
for plotting in TriFusion's Statistics screen and TriStats CLI program.
However, it's important to establish a separation between the generation of
plot data, and the generation of the plot itself. The
:class:`.AlignmentList` methods only generate the data and relevant
instructions necessary to to draw the plot. This information is then passed
on to the appropriate plot generation functions, which are defined in
:mod:`trifusion.base.plotter`. The reason for this separation of tasks is
that many different alignment analyses are represented by the same plot.

The complete process of how new plots can be added to TriFusion is
described here_. In this section, we provide only a few guidelines on what
to expect from these methods.

All plot data methods must be decorated with the :func:`.check_data` decorator
and take at least a `Namespace` argument. In most cases, no more arguments
are required::

    @check_data
    def gene_occupancy(self, ns=None):

The :func:`.check_data` decorator is responsible for performing checks
before and after executing the method. The `Namespace` argument, `ns`,
is used to allow communication between the main and worker threads of
TriFusion.

Additional keyword arguments may be defined, but in that case they must be
provided in TriFusion when the
:meth:`trifusion.app.TriFusionApp.stats_show_plot` method is called,
using the `additional_args` argument. This object will be passed to the
:func:`~trifusion.data.resources.background_tasks.get_stats_data` function
in :mod:`trifusion.data.resources.background_tasks` and used when calling
the plot data methods::

    if additional_args:
        plot_data = methods[stats_idx](ns=ns, **additional_args)
    else:
        plot_data = methods[stats_idx](ns)

The other requirement of plot data methods, is that they must always return
a single dictionary object. This dictionary must contain at least one
`key:value` with the "data" key and a numpy.array with the plot data as the
value. The other entries in the dictionary are optional and refer to
instructions for plot generation. For example, the
`missing_data_distribution` method returns::

    return {"data": data,
            "title": "Distribution of missing data",
            "legend": legend,
            "ax_names": ["Proportion", "Number of genes"],
            "table_header": ["Bin"] + legend}

The first entry is the mandatory "data" key with the numpy.array `data`.
The other instructions are "title", which sets the title of the plot, "legend",
which provides the labels for the plot legend, "ax_names", which provides
the name of the `x` and `y` axis, and the "table_header", which specified
the header for the table of that plot.

The allowed plot instructions depend on the plot function that will be used
and not all of them need to be specified.

.. _here: https://github.com/ODiogoSilva/TriFusion/wiki/
          Add-Statistics-plot-analysis

Logging progress
~~~~~~~~~~~~~~~~

The majority of :class:`.AlignmentList` methods support the setup and update of
progress indicators that can be used in TriFusion (GUI) and the CLI programs.
In the case of TriFusion, the progress indicator is provided via a
`multiprocessing.Namespace` object that transfers information between the
main thread (where the GUI changes take place) and the working thread.
In the case of the CLI programs, the indicator is provided via a `ProgressBar`
object. In either case, the setup, update and resetting of the progress
indicators is performed by the same methods.

At the beginning of an operation, the :meth:`~.AlignmentList_set_pipes`
method is called::

    self._set_pipes(ns, pbar, total=len(self.alignments))

The `Namespace` object is defined as `ns` and the `ProgressBar` is defined
as `pbar`. Usually, only one of them is provided, depending on whether it
was called from TriFusion or from a CLI program. We also set the total of
the progress indicator. In this case it's the number of alignments. In case
the operation is called from TriFusion using the `Namespace` object,
this method also checks the number of active alignments. If there is only
one active alignment, it sets a Namespace attribute that will silence the
progress logging of the :class:`.AlignmentList` object and receive the
information from the :class:`.Alignment` object instead::

    if ns:
        if len(self.alignments) == 1:
            ns.sa = True
        else:
            ns.sa = False

Then, inside the task, we can update the progress within a loop using
the :meth:`~.AlignmentList._update_pipes` method::

    for p, alignment_obj in enumerate(list(self.alignments.values())):
        self._update_pipes(ns, pbar, value=p + 1,
                           msg="Some message")

Here, we provide the `Namespace` and `ProgressBar` objects as before.
In addition we provide the `value` associated with each iteration.
Optionally, the `msg` argument can be specified, which is used exclusively
by TriFusion to show a custom message.

At the end of a task its good practice to reset the progress indicators
by using the :meth:`~.AlignmentList._reset_pipes` method::

    self._reset_pipes(ns)

Here, only the `Namespace` object is necessary, since the `ProgressBar`
indicator is automatically reset on :meth:`~.AlignmentList._set_pipes`.

Incorporating a kill switch
~~~~~~~~~~~~~~~~~~~~~~~~~~~

All time consuming methods of :class:`.AlignmentList` accept a `Namespace`
object when called from TriFusion, allowing communication between the main
thread and the worker thread. Since python `threads` cannot be forcibly
terminated like `processes`, most methods should listen to a kill switch
flag that is actually an attribute of the `Namespace` object. This kill
switch is already incorporated into the :meth:`~.AlignmentList._set_pipes`,
:meth:`~.AlignmentList._update_pipes` and
:meth:`~.AlignmentList._reset_pipes` methods. ::

    if ns:
        if ns.stop:
            raise KillByUser("")

What it does is to listen to a kill signal from TriFusion's window (it can
be clicking on the "Cancel" button, for example). When this kill signal is
received in the main thread, it sets the `Namespace.stop` attribute to True
in both main and worker threads. In the worker thread, when the `stop`
attribute evaluates to True, a custom `KillByUser` exception is raised,
immediately stopping the task. The exception is then handled in the
:func:`trifusion.data.resources.background_tasks.process_execution` function
and transmitted to the main thread.

Using SQLite
------------

A great deal of the high performance and memory efficiency of the `sequence`
module comes from the use of sqlite to store and manipulate alignment data
on disk rather than RAM. This means that parsing, modifying and writing
alignment data must be done with great care to ensure that only the
essential data is loaded into memory, while minimizing the number of (
expensive) database queries. This has some implications for the methods of
both :class:`.Alignment` and :class:`.AlignmentList` objects with respect to
how parsing, alignment modification and output writing is performed.

Implications for parsing
~~~~~~~~~~~~~~~~~~~~~~~~

When writing or modifying parsing methods it is important to take into
account that alignment files can be very large (> 1Gb) and loading the
entire data into memory should be avoided. Whenever possible, only the data
of a single taxon should be kept in memory before inserting it into the
database and then releasing that memory. For most formats, particularly
leave formats, it's fairly straightforward to do this. However, interleave
formats can fragment the data from each taxon across the entire file. Since
database insertions and updates are expensive, loading the data in each line
can greatly decrease the performance in these formats. Therefore,
it's preferable to read the alignment file once per taxon, join the entire
sequence of that taxon, and then insert it into the database. This ensures
that only sequence data from one taxon is kept in memory at any given time
and only a minimal number of database insertions are performed. It will also
result in the same file being parsed N times, where N is the number of taxa.
However, the decrease in speed is marginal, since most lines are actually
skipped, whereas the efficiency increase is substantial.

Implications for fetching data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Retrieving data from an sqlite database is not as simple as accessing python
native data structure. Therefore, a set of methods and generators have
been defined in the :class:`.Alignment` object to facilitate the interface with
the data in the database (See `Data fetching`_). When some kind of data
is required from the database, it is preferable to modify or create a
dedicated method, instead of interacting with the database directly. This
creates a layer of abstraction between the database and :class:`.Alignment`/
:class:`.AlignmentList` methods that greatly facilitates future updates.

Implications for alignment modification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When performing modification to the alignment data it is important to take
into account that the original alignment data may need to be preserved for
future operations. These methods must be defined and called using
appropriate decorators and arguments that establish the name of the
database table from where information will be retrieved, and the name of the
table where the information will be written (See `Alignment modifiers`_).

Implications for writing files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When writing alignment data into new output files, the same caution of the
alignment parsing is advised. It's quite easy to let the entire alignment
be loaded into RAM, particularly when writing in interleave format.

"""

import numpy as np
import pandas as pd
from collections import Counter, defaultdict, OrderedDict
import itertools
import re
import os
from os.path import join, basename, splitext, exists
from itertools import compress
from threading import Lock
import functools
import sqlite3

# TriFusion imports

try:
    import process
    from process.base import dna_chars, aminoacid_table, iupac, \
        iupac_rev, iupac_conv, Base
    from process.data import Partitions
    from process.data import PartitionException
    from process.error_handling import DuplicateTaxa, KillByUser, \
        InvalidSequenceType, EmptyData, InputError, EmptyAlignment, \
        MultipleSequenceTypes, SingleAlignment
except ImportError:
    import trifusion.process as process
    from trifusion.process.base import dna_chars, aminoacid_table, iupac, \
        iupac_rev, iupac_conv, Base
    from trifusion.process.data import Partitions
    from trifusion.process.data import PartitionException
    from trifusion.process.error_handling import DuplicateTaxa, KillByUser, \
        InvalidSequenceType, EmptyData, InputError, EmptyAlignment, \
        MultipleSequenceTypes, SingleAlignment

# import pickle
# TODO: Create a SequenceSet class for sets of sequences that do not conform
# to an alignment, i.e. unequal length.This would eliminate the problems of
# applying methods designed for alignments to sets of sequences with unequal
# length would allows these sets of sequences to have methods of their own.
# However, the __init__ of the current Alignment object should apply to both
# SequenceSet and Alignment classes. So, I'll have to re-structure the code
# somehow.
# TODO After creating the SequenceSet class, an additional class should be
# used to make the triage of files to either the Alignment or SequenceSet
# classes

# Lock mechanism to prevent concurrent access to sqlite database
lock = Lock()


[docs]class LookupDatabase(object): """Decorator handling hash lookup table with pre-calculated values. This decorator class is used to decorate class methods that calculate pairwise sequence similarity. To ensure proper functionality, the decorated method should be called first with a single "connect" argument and after finishing all calculations, with a single "disconnect" argument. These special method callings are necessary to setup and close the database storing the calculated values, respectively. Parameters ---------- func : function Decorated function Attributes ---------- func : function Decorated function con : None or sqlite.Connection Connection object of the sqlite database c : None or sqlite.Cursor Cursor object of the sqlite database Notes ----- The `con` and `c` attributes are initialized when the decorated function is called with the single "connect" argument (e.g. decorated_func("connect")). When the decorated function is called with the single "disconnect" argument (e.g. decorated_func("disconnect")), the database changes are committed, the sqlite Connection is closed and the Cursor is reset. """ def __init__(self, func): self.func = func self.con = None self.c = None
[docs] def __call__(self, *args): """Wraps the call of `func`. When the decorated method is called, this code is wrapped around its execution. It accepts an arbitrary number of positional arguments but is is currently designed to decorate the `_get_similarity` method of the `AlignmentList` class. There are three expected calling modes: 1. decorated_func("connect") : Initializes the Connection and Cursor objects of the database 2. decorated_func(seq1, seq2, locus_lenght) : The typical main execution of the method, providing the sequence1 and sequence2 strings along with the integer with their lenght 3. decorated_func("disconnect") : Commits changes to database and closes Connection and Cursor objects. The path of this sqlite database is automatically obtained from the `AlignmentList` attribute `sql_path`. We use the path to the same directory, but with a different name for the database, "pw.db". Parameters ---------- args : list List of positional arguments for decorated method Notes ----- When the decorated method is called normally (i.e., not with a "connect" or "disconnect" argument"), it first creates an hash of the argument list (excluding `self`). This hash is used to query the database to check if a value for that combination of sequence strings and sequence length has already been calculated. If yes, the result is immediately returned without executing the decorated method. If no, the method is executed to perform calculations. The result of this calculation is then stored in the database and the result is returned. """ c_args = args[1:3] # Initialize database connection. if c_args[0] == "connect": # Get path to database based on the AlignmentList db tmp_dir = os.path.dirname(args[0].sql_path) self.con = sqlite3.connect(join(tmp_dir, "pw.db")) self.c = self.con.cursor() self.c.execute("PRAGMA synchronous = OFF") # Create table, if it does not yet exist if not self.c.execute("SELECT name FROM sqlite_master WHERE type=" "'table' AND name='pw_table'").fetchall(): self.c.execute("CREATE TABLE pw_table (hash integer" ", seq_len integer" ", val float, " "effective_len float," "PRIMARY KEY (hash, seq_len))") return elif c_args[0] == "disconnect": self.con.commit() self.con.close() self.c = None return if self.c: h = hash(c_args) self.c.execute("SELECT * FROM pw_table WHERE hash = ?" " AND seq_len = ?", (h, args[-1])) vals = self.c.fetchone() if vals: return vals[2], vals[3] else: value = self.func(*args) h = hash(c_args) self.c.execute("INSERT INTO pw_table VALUES (?, ?, ?, ?)", (h, args[-1], value[0], value[1])) return value else: return self.func(*args)
def __repr__(self): """ Return the function's docstring. """ return self.func.__doc__ def __get__(self, obj, objtype): """ Support instance methods """ return functools.partial(self.__call__, obj)
[docs]def check_data(func): """Decorator handling the result from AlignmentList plotting methods. This should decorate all plotting methods from the `AlignmentList` class. It can be used to control warnings and handle exceptions when calling the plotting methods. Currently, it ensures that a list of methods are not executed when the `AlignmentList` instance contains a single alignment, and handles cases where the plotting methods return an empty array of data. Further control can be added here to prevent methods from being executed in certain conditions and handling certain outputs of the plotting methods. The only requirement is that, even when the plotting methods are not executed, this should always return a dictionary. In such cases, this dictionary should contain a single key:value with {"exception":<exception_type_string>}. The <exception_type_string> should be a simple string with an informative name that will be handled in the `stats_write_plot` of the `TriFusionApp` class. Parameters ---------- func : function Decorated function Returns ------- res : dict The value returned by `func` is always a dictionary. If the decorator prevents the execution of `func` for some reason, ensure that a dictionary is always return with a single key:value {"exception": <exception_type_string>} """ # functools.wraps(func) allows the correct generation of docstrings # by sphinx of the decorated objects. @functools.wraps(func) def wrapper(*args, **kwargs): # Plot methods that should not be allowed to continue with only one # active alignment no_single_plot = ["outlier_missing_data", "outlier_missing_data_sp", "outlier_segregating", "outlier_segregating_sp", "outlier_sequence_size", "outlier_sequence_size_sp", "average_seqsize_per_species", "average_seqsize", "sequence_similarity", "sequence_segregation", "length_polymorphism_correlation", "taxa_distribution", "cumulative_missing_genes", "gene_occupancy", "missing_data_distribution", "missing_genes_average"] # Calling outlier method with a single alignments should immediately # raise an exception if len(args[0].alignments) == 1: if func.__name__ in no_single_plot: return {"exception": "single_alignment"} res = func(*args, **kwargs) if "data" in res: if np.asarray(res["data"]).any(): return res else: return {"exception": "empty_data"} elif "exception" in res: return res return wrapper
[docs]def setup_database(func): """Decorator handling the active database tables. Decorates methods from the `Alignment` object that use and perform modifications to the original alignment data. All methods decorated with this must have the keyword arguments `table_in` and `table_out` (and `use_main_table`, optionally). The values associated with these arguments will determine which tables will be used and created before the execution of the decorated method (See Notes for the rationale). The strings provided as arguments for `table_in` and `table_out` will serve as a suffix to the `Alignment` instance attribute `table_name`. If `use_main_table` is provided, and is set to True, both `table_in` and `table_out` will default to `Alignment.table_name`. Values provided for `table_in` and `table_out` at calling time will be ignored. The following cases assume `use_main_table` is not provided or set to False. If both `table_in` and `table_out` are provided, `table_out` is modified so that `table_out = Alignment.table_name + table_out`. If `table_out` is not provided, it defaults to `Alignment.table_name`. If the final `table_out` does not exist in the database, create it. In this case, if `table_in` will default to `Alignment.table_name`. If `table_out` already exists in the database, then `table_in=table_out`. Parameters ---------- func : function Decorated function Notes ----- In order to fully understand the mechanism behind the setup of the database tables, one must first know that when an `Alignment` instance is created, it generates a table in the database containing the original data from the alignment. In TriFusion (GUI), this original table MUST NOT be modified, since users may want to execute several methods on the same `Alignment` object. Therefore, when a particular method needs to modify the original alignment, a new temporary table is created to store the modified version until the end of the execution. If a second modification is requested, it will also be necessary to set the input table as the output table of the previous alignment modification. In the case of TriSeq (CLI), there is no such requirement, so it's much simpler to use the same table for all modifications. This decorator greatly simplifies this process in the same way for all methods of the `Alignment` object that modify the original alignment data. To accomplish this, all decorated method must have the keyword arguments: `table_in` and `table_out` (and `use_main_table`, optionally). For methods called within the execution of TriFusion, the idea is simple. Since we don't know which methods will be used by the user, all chained methods that will create the same output file can be called with `table_in=new_table` and `table_out=new_table`. Note that bot arguments have the same value. Whatever is the first method being called, it will face the fact that "new_table" does not yet exist. In this case, the decorator will create a "new_table" in the database and reset `table_in` to the original `Alignment.table_name`. This ensures that the first method will still be able to fetch the alignment data. In the following methods, `table_in` and `table_out` will be used based on the original values, ensuring that the alignment data is being fetched from the last modification made and that the modification chain is maintained. In this way, the execution of `Alignment` methods can be the same, regardless of the order or number of operations requested by the user. In the case of TriSeq, the methods should set `use_main_table=True`. This will always set `table_in=table_out=Alignment.table_name`, which means that the main database will be used for all methods. """ @functools.wraps(func) def wrapper(*args, **kwargs): # If use_main_table is set, the table_in will be overwritten by # table_out if "use_main_table" in kwargs: if kwargs["use_main_table"]: kwargs["table_out"] = args[0].table_name kwargs["table_in"] = args[0].table_name return func(*args, **kwargs) # Get sqlite database cursor and main table name sql_cur = args[0].cur # Define sqlite table name based on the main Alignment table name # and the provided table_name argument if "table_out" in kwargs: table_out = args[0].table_name + kwargs["table_out"] kwargs["table_out"] = table_out else: table_out = args[0].table_name kwargs["table_out"] = args[0].table_name # Add table name to active table names if table_out not in args[0].tables: args[0].tables.append(table_out) # Check if the table already exists, if not create it. if not sql_cur.execute( "SELECT name FROM sqlite_master WHERE type='table' AND" " name='{}'".format(table_out)).fetchall(): sql_cur.execute("CREATE TABLE [{}](" "txId INT," "taxon TEXT," "seq TEXT)".format(table_out)) # Set the table_in argument (table where the sequences will be # fecthed) to the main Alignment table name kwargs["table_in"] = args[0].table_name else: # Since the table_out already exists, sequences will be fetched # from this table instead of the main table name kwargs["table_in"] = table_out return func(*args, **kwargs) return wrapper
[docs]def setup_intable(func): """Decorator handling active database table when fetching data. This class is mean to decorate methods of the `Alignment` class that retrieves alignment data from the database. The requirement is that these methods have the positional arguments `table_suffix`, `table_name` and `table`. These are all optional, and only `table_suffix` and `table_name` should be used when calling these methods. The values of these two will be used to define the value of `table`, so any value provided to this argument when calling the decorated method will be ignored. In the end, the `table` variable will contain the final table name and only this variable will be used to retrieve the alignment data. `table_suffix` is a suffix that is appended to `Alignment.table_name`. `table_name` defines a complete table name. If neither `table_suffix` nor `table_name` are provided, then `table` will default to `Alignment.table_name`. If `table_name` is provided, `table=table_name`. If `table_suffix` is provided, `table=Alignment.table_name + table_suffix`. If both are provided, `table_name` takes precedence over `table_suffix` and `table=table_name`. In any case, we test the existence of the final `table` value in the database. If it does not exist, `table` will revert to `Alignment.table_name` to prevent errors. Parameters ---------- func : function Decorated function Attributes ---------- func : function Decorated function """ # functools.wraps(func) allows the correct generation of docstrings # by sphinx of the decorated objects. @functools.wraps(func) def wrapper(*args, **kwargs): # Get sqlite database cursor and main table name sql_cur = args[0].cur if "table_name" not in kwargs and "table_suffix" not in kwargs: kwargs["table"] = args[0].table_name return func(*args, **kwargs) try: tb_name = kwargs["table_name"] except KeyError: tb_name = None try: tb_suffix = kwargs["table_suffix"] except KeyError: tb_suffix = "" if tb_name: table = tb_name elif tb_suffix: table = args[0].table_name + tb_suffix else: table = args[0].table_name # Check if the specified table exists. If not, revert to # self.table_name. When executing Process operations, the first valid # operation may be one that only has a "table_in" argument but several # other operations may be defined before with a different table suffix. # In such case, the table passed to "table_in" was not yet created and # populated in those prior methods. Therefore, here we account for # those cases by reverting the table to self.table_name if table_in # was not yet created. if not sql_cur.execute( "SELECT name FROM sqlite_master WHERE type='table' AND" " name='{}'".format(table)).fetchall(): table = args[0].table_name kwargs["table"] = table return func(*args, **kwargs) return wrapper
[docs]class AlignmentException(Exception): """ Generic Alignment object exception. """ pass
[docs]class AlignmentUnequalLength(Exception): """ Raised when sequences in alignment have unequal length. """ pass
[docs]class Alignment(Base): """Main interface for single alignment files. The `Alignment` class is the main interface for single alignment files, providing methods that parse, query, retrieve and modify alignment data. It is usually created within the context of `AlignmentList`, which handles the connection with the sqlite database. Nevertheless, `Alignment` instances can be created by providing the `input_alignment` argument, which can be one of the following: - path to alignment file. In this case, the file is parsed and information stored in a new sqlite table. - sqlite table name. In this case, the table name must be present in the database. In either case, the sqlite `Connection` and `Cursor` objects should be provided. When an `Alignment` object is instantiated, it first generates the `table_name` based on the `input_alignment` string, filtering all characters that are not alpha numeric. Then, it queries the database to check is a table already exists with that name. If yes, it is assumed that the alignment data is stored in the provided table name. If there is no table with that name, it is assumed that `input_alignment` is a path to the alignment file and the regular parsing ensues. An empty table is created, the sequence type, format and missing data symbol are automatically detected and the alignment is parsed according to the detected format. Parameters ---------- input_alignment : str Can be either the path to an alignment file or the name of a table in the sqlite database. sql_cursor : sqlite3.Cursor Cursor object of the sqlite database. sql_con : sqlite3.Connection Connection object of the sqlite database. input_format : str, optional File format of `input_alignment`. If `input_alignment` is a file path, the format will be automatically detect from the file. The value provided with this argument overrides the automatic detection's result. partitions : `trifusion.process.data.Partitions`, optional If provided, it will set the `partitions` attribute. This should be used only when `input_alignment` is a database table name. locus_length : int, optional Sets the length of the current alignment, stored in the `locus_length` attribute. This option should only be used when `input_alignment` is a database table name. Otherwise, it is automatically set during alignment parsing. sequence_code : tuple, optional Sets the `sequence_code` attribute with the information on (<sequence_type>, <missing data symbol>). This option should only be used when `input_alignment` is a database table name. Otherwise, it is automatically set during alignment parsing. taxa_list : list, optional Sets the list attribute `taxa_list` with the names of the taxa present in the alignment. This option should only be used when `input_alignment` is a database table name. Otherwise, it is automatically set during alignment parsing. taxa_idx : dict, optional Sets the dictionary attribute `taxa_idx` that maps the taxon names to their index in the sqlite database table. This option should only be used when `input_alignment` is a database table name. Otherwise, it is automatically set during alignment parsing. Attributes ---------- cur : sqlite3.Cursor Cursor object of the sqlite database. con : sqlite3.Connection Connection object of the sqlite database. table_name : str Name of the sqlite database's table storing the sequence data. tables : list Lists the sqlite database's tables created for the `Alignment` instance. The "master" table is always present and represents the original alignment. partitions : `trifusion.process.data.Partitions` Stores the partitions and substitution model definition for the alignment. locus_length : int Length of the alignment in base pairs or residues. restriction_range : str Only used when gaps are coded in a binary matrix. Stores a string with the range of the restriction-type data that will encode gaps and will only be used when nexus is in the output format. e : None or Exception Stores any exceptions that occur during the parsing of the alignment file. It remains None unless something wrong happens. taxa_list : list List with the active taxon names. taxa_idx : dict Maps the taxon names to their corresponding index in the sqlite database. The index is not retrieved from the position of the taxon in `taxa_list` to prevent messing up when taxa are removed from the `Alignment` object. shelved_taxa : list List of ignored taxon names. path : str Full path to alignment file. sname : str Basename of the alignment file without the extension. name : str Basename of the alignment file with the extension. sequence_code : tuple Contains information on (<sequence type>, <missing data symbol>), e.g. ("Protein", "x"). interleave_data : bool Attribute that is set to True when interleave data has been created for the alignment data. input_format : str Format of the input alignment file. Notes ----- The `Alignment` class is not meant to be used directly (although it is quite possible to do so if you handle the database connections before). Instead, use the `AlignmentList` class, even if there is only one alignment. All `Alignment` methods are available from the `AlignmentList` and it is also possible to retrieve specific `Alignment` objects. The `Alignment` class was designed to be a lightweight, fast and powerful interface between alignment data and a set of manipulation and transformation methods. For performance and efficiency purposes, all alignment data is stored in a sqlite database that prevents the entire alignment from being loaded into memory. To facilitate the retrieval and iteration over the alignment data, several methods (`iter_columns`, 'iter_sequences`, etc) are available to handle the interface with the database and retrieve only the necessary information. See Also -------- AlignmentList AlignmentList.add_alignment_files AlignmentList.retrieve_alignment """ def __init__(self, input_alignment, input_format=None, partitions=None, locus_length=None, sequence_code=None, taxa_list=None, taxa_idx=None, sql_cursor=None, sql_con=None): self.cur = sql_cursor self.con = sql_con if isinstance(partitions, Partitions): self.partitions = partitions else: self.partitions = Partitions() """ Initializing a Partitions instance for the current alignment. By default, the Partitions instance will assume that the whole Alignment object is one single partition. However, if the current alignment is the result of a concatenation, or if a partitions file is provided, the partitions argument can be used. Partitions can be later changed using the set_partitions method. Substitution models objects are associated with each partition and they default to None """ if not locus_length: self.locus_length = 0 else: self.locus_length = locus_length """ The length of the alignment object. Even if the current alignment object is partitioned, this will return the length of the entire alignment """ self.restriction_range = None """ This option is only relevant when gaps are coded. This will store a string with the range of the restriction-type data that will encode gaps and will only be used when nexus is in the output format """ self.e = None """ The `e` attribute will store any exceptions that occur during the parsing of the alignment object. It remains None unless something wrong happens. """ self.taxa_list = None """ Attribute that will store the taxa names. """ self.shelved_taxa = [] """ Attribute that will store shelved taxa. When retrieving taxa from the database, this list will be checked and present taxa will be ignored """ self.taxa_idx = None """ Attribute that will store the index of the taxa in the sql database. This index is stored in a dictionary instead of being retrieved from the index position of the taxa_list list because taxa may be removed (which would mess up the link between the two indexes) """ self.path = input_alignment """ Attribute with full path to alignment file """ self.sname = basename(splitext(input_alignment)[0]) """ Attribute with basename of alignment file without extension """ self.name = basename(input_alignment) """ Attribute with basename of alignment file """ self.sequence_code = sequence_code if taxa_list: self.taxa_list = taxa_list else: self.taxa_list = [] if taxa_idx: self.taxa_idx = taxa_idx else: self.taxa_idx = {} self.table_name = "".join([x for x in self.path if x.isalnum()]) """ Creates a database table for the current alignment. This will be the link attribute between the sqlite database and the remaining methods that require information on the sequence data. This also removes any non alpha numeric characters the table name might have and ensures that it starts with a aphabetic character to avoid a database error """ self.tables = [] """ Lists the currently active tables for the Alignment object. The 'master' table is always present and represents the original alignment. Additional tables may be added as needed, and then dropped when no longer necessary. When all tables, except the master, need to be removed, this attribute can be used to quickly drop all derived tables """ self.interleave_data = False """ Attribute that is set to True when interleave data has been created for the alignment data. Buiding the interleave matrix is a bit costly, so when it is requested by the user it is built once and stored in the database, and then further usages will use that table. """ """ NOTE ON POSSIBLE DUPLICATE TABLE NAMES: It is not the responsibility of the Alignment class to check on duplicate table names. If an exiting table name is found in the database, the Alignment class assumes that the alignment has already been parsed and inserted into the database. Therefore, it will not parse the alignment file again. This usually happens when creating concatenated or reverse concatenated alignments, where the sequence data is inserted into the database before instantiating the Alignment class. IT IS THE RESPONSABILITY OF THE AlignmentList CLASS AND OTHER WRAPPERS TO ASSESS THE VALIDITY OR NOT OF POTENTIAL DUPLICATE ALIGNMENT FILES. """ # Check if table for current alignment file already exists. If it # does not exist, it is assumed that input_alignment is the path to # the alignment file if not self._table_exists(self.table_name): # Get alignment format and code. Sequence code is a tuple of # (DNA, N) or (Protein, X) finder_content = self.autofinder(input_alignment) # Handles the case where the input format is invalid and # finder_content is an Exception if isinstance(finder_content, Exception) is False: self.input_format, self.sequence_code = self.autofinder( input_alignment) # Create database table when there are no issues in the # recognition of the file format and sequence code self._create_table(self.table_name) # In case the input format is specified, overwrite the # attribute if input_format: self.input_format = input_format """ Format of the alignment file. """ # parsing the alignment self.read_alignment() else: # Setting the sequence code attribute for seq type checking # in AlignmentList self.sequence_code = None self.e = finder_content # In case there is a table for the provided input_alignment else: self.input_format = input_format def __iter__(self): """Iterator behavior for `Alignment`. Iterates over taxa and sequence data from the alignment. Taxon names in the `shelved_taxa` attribute are ignored. This will always retrieve the alignment data from the master database table, `table_name`. Yields ------ tx : str taxon name. seq : str sequence string. """ for tx, seq in self.cur.execute( "SELECT taxon,seq from [{}]".format( self.table_name)): if tx not in self.shelved_taxa: yield tx, seq
[docs] def _create_table(self, table_name, cur=None): """Creates a new table in the database. Convenience method that creates a new table in the sqlite database. It accepts a custom Cursor object, which overrides the `cur` attribute. Parameters ---------- table_name : str Name of the table to be created. cur : sqlite3.Cursor, optional Custom Cursor object used to query the database. """ if not cur: cur = self.cur cur.execute("CREATE TABLE [{}](" "txId INT," "taxon TEXT," "seq TEXT)".format(table_name))
[docs] def _table_exists(self, table_name, cur=None): """ Checks if a table exists in the database. Convenience method that checks if a table exists in the database. It accepts a custom Cursor object, which overrides the `cur` attribute. Parameters ---------- table_name : str Name of the table. cur: sqlite3.Cursor, optional Custom Cursor object used to query the database. Returns ------- res : list List with the results of a query for 'table' type with `table_name` name. Is empty when the table does not exist. Notes ----- This returns a list that will contain one item if the table exists in the database. If it doesn't exist, returns an empty list. Therefore, this can be used simply like:: if self._table_exists("my_table"): # Do stuff` """ if not cur: cur = self.cur res = cur.execute( "SELECT name FROM sqlite_master WHERE type='table' AND" " name='{}'".format(table_name)).fetchall() return res
[docs] def _set_format(self, input_format): """Manually set the input format of the `Alignment` object. Manually sets the input format associated with the `Alignment` object Parameters ---------- input_format : str The input format. """ self.input_format = input_format
@staticmethod
[docs] def _set_pipes(ns=None, pbar=None, total=None, msg=None): """Setup of progress indicators for both GUI and CLI task executions. This handles the setup of the objects responsible for updating the progress of task's execution of both TriFusion (GUI) and TriSeq (CLI). At the beginning of any given task, these objects can be initialized by providing either the Namespace object (`ns`) in the case of TriFusion, or the ProgressBar object (`pbar`), in the case of TriSeq. Along with one of these objects, the expected `total` of the progress should also be provided. The `ns` and `pbar` objects are updated at each iteration of a given task, and the `total` is used to get a measure of the progress. Optionally, a message can be also provided for the Namespace object that will be used by TriFusion. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. total : int Expected total of the task's progress. msg : str, optional A secondary message that appears in TriFusion's progress dialogs. Notes ----- The progress of any given task can be provided by either an `Alignment` or `AlignmentList` instance. Generally, the tasks follow the progress of the `AlignmentList` instance, unless that instance contains only one `Alignment` object. In that case, the progress information is piped from the `Alignment` instance. For that reason, and for the Namespace (`ns`) object only, this method first checks if the `ns.sa` attribute exists. If it does, it means that the parent `AlignmentList` instance contains only one `Alignment` object and, therefore, the progress information shoud come from here. Examples -------- Start a progress counter for a task that will make 100 iterations:: self._set_pipes(ns=ns_obj, pbar=pbar_obj, total=100, msg="Some message") See Also -------- _reset_pipes _update_pipes """ # Check if the `sa` attribute exists in the Namespace object. # If yes, then the logging will be performed here. If not, # The progress attributes will not be set for `ns`. try: sa = ns.sa except AttributeError: sa = True if ns: # Listen for kill switch if ns.stop: raise KillByUser("") # Set attributes only if parent AlignmentList contains a single # Alignment object if sa: ns.total = total ns.counter = 0 ns.msg = msg if pbar: pbar.max_value = total # Reset ProgressBar object counter pbar.update(0)
@staticmethod
[docs] def _update_pipes(ns=None, pbar=None, value=None, msg=None): """Update progress indicators for both GUI and CLI task executions. This method provides a single interface for updating the progress objects `ns` or `pbar`, which should have been initialized at the beginning of the task with the `_set_pipes` method. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. value : int Value of the current progress index msg : str, optional A secondary message that appears in TriFusion's progress dialogs. Notes ----- The progress of any given task can be provided by either an `Alignment` or `AlignmentList` instance. Generally, the tasks follow the progress of the `AlignmentList` instance, unless that instance contains only one `Alignment` object. In that case, the progress information is piped from the `Alignment` instance. For that reason, and for the Namespace (`ns`) object only, this method first checks if the `ns.sa` attribute exists. If it does, it means that the parent `AlignmentList` instance contains only one `Alignment` object and, therefore, the progress information shoud come from here. Examples -------- Update the counter in an iteration of 100:: for i in range(100): self._update_pipes(ns=ns_obj, pbar=pbar_obj, value=i, msg="Some string") See Also -------- _set_pipes _reset_pipes """ try: sa = ns.sa except AttributeError: sa = True if ns: if ns.stop: raise KillByUser("") if sa: ns.counter = value ns.msg = msg if pbar: pbar.update(value)
@staticmethod
[docs] def _reset_pipes(ns): """Reset progress indicators for both GUI and CLI task executions. This should be done at the end of any task that initialized the progress objects, but it only affects the Namespace object. It resets all Namespace attributes to None. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. See Also -------- _set_pipes _update_pipes """ try: sa = ns.sa except AttributeError: sa = True if ns: if ns.stop: raise KillByUser("") if sa: ns.total = ns.counter = ns.msg = ns.sa = None
@setup_intable
[docs] def iter_columns(self, table_suffix="", table_name=None, table=None): """Generator over alignment columns. Generator that yields the alignment columns in a list of characters. Sequence data is retrieved from a database table specified either by the `table_name` or `table_suffix` arguments. `table_name` will always take precedence over `table_suffix` if both are provided. If none are provided, the default `Alignment.table_name` is used. If the table name provided by either `table_name` or `table_suffix` is invalid, the default `Alignment.table_name` is also used. The `table` variable will be automatically setup in the `SetupInTable` decorator according to `table_suffix` and `table_name`, and should not be used directly (values provided when calling it are actually ignored). The decorator will set this variable Parameters ---------- table_suffix : string Suffix of the table from where the sequence data is fetched. The suffix is appended to `Alignment.table_name`. table_name : string Name of the table from where the sequence data is fetched. table : string This argument will be automatically setup by the `SetupInTable` decorator. Do not use directly. Yields ------ i : list List with alignment column characters as elements (["A", "A", "C", "A", "A"]). See Also -------- SetupInTable Notes ----- To prevent the query from loading the entire alignment data into memory, two techniques are employed. First, the iteration over the query results is lazy, using `itertools.izip`. Second, a `range` iteration is performed to ensure that only blocks of 100k alignment sites are fetch at any given time. Several tests were performed to ensure that this block size provided a good trade-off between RAM usage and speed. All generators ignore data associated with taxa present in the `shelved_taxa` attribute. """ try: # Locking mechanism necessary to avoid concurrency issues when # accessing the database. This ensures that only one Cursor # object is querying the database at any given time lock.acquire(True) for p in range(0, self.locus_length, 100000): for i in itertools.izip(*(x[1] for x in self.cur.execute( "SELECT taxon, substr(seq, {}, 100000)" " FROM [{}]".format(p, table)) if x[0] not in self.shelved_taxa)): yield i finally: lock.release()
@setup_intable
[docs] def iter_columns_uniq(self, table_suffix="", table_name=None, table=None): """ Generator over unique characters of an alignment column. Generator that yields unique elements of alignment columns in a list. Sequence data is retrieved from a database table specified either by the `table_name` or `table_suffix` arguments. `table_name` will always take precedence over `table_suffix` if both are provided. If none are provided, the default `Alignment.table_name` is used. If the table name provided by either `table_name` or `table_suffix` is invalid, the default `Alignment.table_name` is also used. The `table` variable will be automatically setup in the `SetupInTable` decorator according to `table_suffix` and `table_name`, and should not be used directly (values provided when calling it are actually ignored). The decorator will set this variable Parameters ---------- table_suffix : string Suffix of the table from where the sequence data is fetched. The suffix is appended to `Alignment.table_name`. table_name : string Name of the table from where the sequence data is fetched. table : string This argument will be automatically setup by the `SetupInTable` decorator. Do not use directly. Yields ------ i : list List with alignment column characters as elements (["A", "C"]). See Also -------- SetupInTable Notes ----- To prevent the query from loading the entire alignment data into memory, two techniques are employed. First, the iteration over the query results is lazy, using `itertools.izip`. Second, a `range` iteration is performed to ensure that only blocks of 100k alignment sites are fetch at any given time. Several tests were performed to ensure that this block size provided a good trade-off between RAM usage and speed. All generators ignore data associated with taxa present in the `shelved_taxa` attribute. """ try: # Locking mechanism necessary to avoid concurrency issues when # accessing the database. This ensures that only one Cursor # object is querying the database at any given time lock.acquire(True) for p in range(0, self.locus_length, 100000): for i in itertools.izip(*(x[1] for x in self.cur.execute( "SELECT taxon, substr(seq, {}, 100000)" " FROM [{}]".format(p, table)) if x[0] not in self.shelved_taxa)): yield list(set(i)) finally: lock.release()
@setup_intable
[docs] def iter_sequences(self, table_suffix="", table_name=None, table=None): """ Generator over sequence strings in the alignment. Generator for sequence data of the `Alignment` object. Sequence data is retrieved from a database table specified either by the `table_name` or `table_suffix` arguments. `table_name` will always take precedence over `table_suffix` if both are provided. If none are provided, the default `Alignment.table_name` is used. If the table name provided by either `table_name` or `table_suffix` is invalid, the default `Alignment.table_name` is also used. Parameters ---------- table_suffix : string Suffix of the table from where the sequence data is fetched. The suffix is appended to `Alignment.table_name`. table_name : string Name of the table from where the sequence data is fetched. table : string This argument will be automatically setup by the `SetupInTable` decorator. Do not use directly. Yields ------ seq : string Sequence string for a given taxon See Also -------- SetupInTable Notes ----- All generators ignore data associated with taxa present in the `shelved_taxa` attribute. """ try: # Locking mechanism necessary to avoid concurrency issues when # accessing the database. This ensures that only one Cursor # object is querying the database at any given time lock.acquire(True) for tx, seq in self.cur.execute( "SELECT taxon,seq FROM [{}]".format(table)): if tx not in self.shelved_taxa: yield seq finally: lock.release()
@setup_intable
[docs] def iter_alignment(self, table_suffix="", table_name=None, table=None): """Generator for (taxon, sequence) tuples. Generator that yields (taxon, sequence) tuples from the database. Sequence data is retrieved from a database table specified either by the `table_name` or `table_suffix` arguments. `table_name` will always take precedence over `table_suffix` if both are provided. If none are provided, the default `Alignment.table_name` is used. If the table name provided by either `table_name` or `table_suffix` is invalid, the default `Alignment.table_name` is also used. Parameters ---------- table_suffix : string Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string Name of the table from where the sequence data is fetched. table : string This argument will be automatically setup by the `SetupInTable` decorator. Do not use directly. Yields ------ tx : string Taxon name. seq : string Sequence string. See Also -------- SetupInTable Notes ----- All generators ignore data associated with taxa present in the `shelved_taxa` attribute. """ try: # Locking mechanism necessary to avoid concurrency issues when # accessing the database. This ensures that only one Cursor # object is querying the database at any given time lock.acquire(True) for tx, seq in self.cur.execute( "SELECT taxon, seq from [{}]".format(table)): if tx not in self.shelved_taxa: yield tx, seq finally: lock.release()
@setup_intable
[docs] def get_sequence(self, taxon, table_name=None, table_suffix="", table=None): """Returns the sequence string for a given taxon. Returns the sequence string of the corresponding `taxon`. If the taxon does not exist in the table, raises a KeyError. The sequence is retrieved from a database table specified either by the `table_name` or `table_suffix` arguments. `table_name` will always take precedence over `table_suffix` if both are provided. If none are provided, the default `Alignment.table_name` is used. If the table name provided by either `table_name` or `table_suffix` is invalid, the default `Alignment.table_name` is also used. Parameters ---------- taxon : str Name of the taxon. table_suffix : string Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string Name of the table from where the sequence data is fetched. table : string This argument will be automatically setup by the `SetupInTable` decorator. Do not use directly. Returns ------- seq : str Sequence string. Raises ------ KeyError If the taxon does not exist in the specified table. Notes ----- Ignores data associated with taxa present in the `shelved_taxa` attribute. """ try: # Locking mechanism necessary to avoid concurrency issues when # accessing the database. This ensures that only one Cursor # object is querying the database at any given time lock.acquire(True) if taxon in self.taxa_list and taxon not in self.shelved_taxa: seq = self.cur.execute( "SELECT seq FROM [{}] WHERE txId=?".format(table), (self.taxa_idx[taxon],)).fetchone()[0] return seq else: raise KeyError finally: lock.release()
[docs] def remove_alignment(self): """ Removes alignment data from the database. """ # Drop main alignment self.cur.execute("DROP TABLE [{}]".format(self.table_name)) # If additional alignments were created, drop those tables as well for table in self.tables: try: self.cur.execute("DROP TABLE [{}]".format(table)) except sqlite3.OperationalError: pass
[docs] def shelve_taxa(self, lst): """Shelves taxa from `Alignment` methods. The taxa provided in the `lst` argument will be ignored in subsequent operations. In practice, taxa from the `taxa_list` attribute that are present in `lst` will be moved to the `shelved_taxa` attribute and ignored in all methods of the class. To revert all taxa to the `taxa_list` attribute, simply call this method with an empty list. Parameters ---------- lst : list List with taxa names that should be ignored. """ self.shelved_taxa = lst self.taxa_list = [x for x in self.taxa_list if x not in self.shelved_taxa]
[docs] def _read_interleave_phylip(self, ntaxa): """ Alignment parser for interleave phylip format. Parses an interleave phylip alignment file and stores taxa and sequence data in the database. This method is only called from the `_read_phylip` method, when the regular phylip parser detects that the file is in interleave format. Parameters ---------- ntaxa : int Number of taxa contained in the alignment file. Returns ------- size_list : list List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in `read_alignment`. See Also -------- _read_phylip Notes ----- The phylip interleave format splits the alignment into blocks of a certain lenght (usually 90 characters) separated by blank lines. This means that to gather the complete sequence of a given taxon, the parser has to read the entire file. To prevent the entire alignment to be loaded into memory, we actually iterate over a range determined by the number of taxa in the alignment. In each iteration, we open a file handle and we retrieve only the sequence of a particular taxon. This ensures that only sequence data for a single taxon is store in memory at any given time. This also means that the alignment file has to be read N times, where N = number of taxa. However, since the vast majority of lines are actually skipped in each iteration, the decrease in speed is marginal, while the gains in memory efficient are much larger. """ size_list = [] for i in xrange(ntaxa): # Variable that will store the sequence for the current taxon. sequence = [] # Index identifier of the current taxon idx = 0 # When True, it means that the alignment line has the taxon # and sequence. When False, it means that the line contains # only sequence. taxa_gather = True fh = open(self.path) # Skip header an any potential blank lines header = "" while not header: header = next(fh) for line in fh: # At blank lines, reset the idx and set taxa_gather to False. # All lines from this point on will only contain sequence. if not line.strip(): idx = 0 taxa_gather = False else: # If the idx of the alignment block matches the taxon # from the current iteration, add to sequence variable if idx == i: # Remove the taxon from the gathering if taxa_gather: sequence.append( "".join(line.strip().lower().split()[1:])) else: sequence.append( "".join(line.strip().lower().split())) idx += 1 seq = "".join(sequence) taxa = self.taxa_list[i] size_list.append(len(seq)) # Insert data into database self.cur.execute("INSERT INTO [{}] VALUES (?, ?, ?)".format( self.table_name), (i, taxa, seq)) fh.close() return size_list
[docs] def _read_interleave_nexus(self, ntaxa): """ Alignment parser for interleave nexus format. Parses an interleave nexus alignment file and stores taxa and sequence data in the database.This method is only called from the `_read_nexus` method, when the regular nexus parser detects that the file is in interleave format. Parameters ---------- ntaxa : int Number of taxa contained in the alignment file. Returns ------- size_list : list List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in `read_alignment`. See Also -------- _read_nexus Notes ----- The nexus interleave format splits the alignment into blocks of a certain lenght (usually 90 characters) separated by blank lines. This means that to gather the complete sequence of a given taxon, the parser has to read the entire file. To prevent the entire alignment to be loaded into memory, we actually iterate over a range determined by the number of taxa in the alignment. In each iteration, we open a file handle and we retrieve only the sequence of a particular taxon. This ensures that only sequence data for a single taxon is store in memory at any given time. This also means that the alignment file has to be read N times, where N = number of taxa. However, since the vast majority of lines are actually skipped in each iteration, the decrease in speed is marginal, while the gains in memory efficient are much larger. """ size_list = [] for i in xrange(ntaxa): # Variable that will store the sequence for the current taxon. sequence = [] # counter used to skip the Nexus header and footer counter = 0 # Index identifier of the current taxon idx = 0 taxa = None fh = open(self.path) for line in fh: if line.strip().lower() == "matrix" and counter == 0: counter = 1 elif line.strip() == ";" and counter == 1: counter = 2 elif counter == 1: if not line.strip(): idx = 0 else: if idx == i: if not taxa: taxa = line.strip().split()[0].replace(" ", "") taxa = self.rm_illegal(taxa) sequence.append( "".join(line.strip().lower().split()[1:])) idx += 1 self.taxa_list.append(taxa) self.taxa_idx[taxa] = i seq = "".join(sequence) if not self.locus_length: self.locus_length = len(seq) size_list.append(len(seq)) self.cur.execute("INSERT INTO [{}] VALUES (?, ?, ?)".format( self.table_name), (i, taxa, seq)) fh.close()
[docs] def _eval_missing_symbol(self, sequence): """Evaluates missing data symbol from sequence string. This method is performed when sequence data is being parsed from the alignment and only executes when the missing data symbol stored in `Alignment.sequence_code` is not defined. It attempts to count the regular characters used to denote missing data. First, tries to find "?" characters, which are set as the symbol if they exist. Then, it finds "n" characters and if sequence type is set to DNA, this character is set as the symbol. Finally, it finds "x" characters and if sequence type is set to Protein, this character is set as the symbol. Parameters ---------- sequence : str Sequence string """ if not self.sequence_code[1]: if sequence.count("?"): self.sequence_code[1] = "?" elif sequence.count("n") and self.sequence_code[0] == "DNA": self.sequence_code[1] = "n" elif sequence.count("x") and self.sequence_code[0] == "Protein": self.sequence_code[1] = "x"
[docs] def _read_phylip(self): """Alignment parser for phylip format. Parses a phylip alignment file and stored taxa and sequence data in the database. Returns ------- size_list : list List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in `read_alignment`. See Also -------- read_alignment """ fh = open(self.path) # Variable storing the lenght of each sequence size_list = [] # Get the number of taxa and sequence length from the file header header = fh.readline().split() self.locus_length = int(header[1]) self.partitions.set_length(self.locus_length) taxa_num = int(header[0]) # These three following attributes allow the support for # interleave phylip files # Flags whether the current line should be parsed for taxon name taxa_gather = True # Counter that makes the correspondence between the current line # and the appropriate taxon c = 0 for line in fh: # Ignore empty lines if line.strip() == "": continue # The counter is reset when surpassing the number of # expected taxa. if c + 1 > taxa_num: c = 0 taxa_gather = False try: # Here, assume that all lines with taxon names have # already been processed, since he taxa_pos variable # already has the same number as the expected taxa in the # phylip header if not taxa_gather: # Oh boy, this seems like an interleave phylip file. # Redirect parsing to appropriate method self.cur.execute("DELETE FROM [{}]".format( self.table_name)) size_list = self._read_interleave_phylip(taxa_num) break # To support interleave phylip, while the taxa_pos # variable has not reached the expected number of taxa # provided in header[0], treat the line as the first # lines of the phylip where the first field is the taxon # name elif taxa_gather: taxa = line.split()[0].replace(" ", "") taxa = self.rm_illegal(taxa) try: # Joint multiple batches of sequence # Remove any possible whitespace by splitting # according to whitespace. This also ensures that # sequence is always a list when added to # sequence_data sequence = line.strip().lower().split()[1:] except IndexError: sequence = [""] self.taxa_list.append(taxa) self.taxa_idx[taxa] = c seq = "".join(sequence) # Evaluate missing data symbol if undefined self._eval_missing_symbol(seq) self.cur.execute("INSERT INTO [{}] VALUES" "(?, ?, ?)".format( self.table_name), (c, taxa, seq)) size_list.append(len(seq)) # Add counter for interleave processing c += 1 # sequence_data.append((taxa, "".join(sequence))) except IndexError: pass # Updating partitions object self.partitions.add_partition(self.name, self.locus_length, file_name=self.path) fh.close() return size_list
[docs] def _read_fasta(self): """Alignment parser for fasta format. Parses a fasta alignment file and stores taxa and sequence data in the database. Returns ------- size_list : list List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in `read_alignment`. See Also -------- read_alignment """ fh = open(self.path) # Variable storing the lenght of each sequence size_list = [] sequence = [] idx = 0 for line in fh: if line.strip().startswith(">"): if sequence: seq = "".join(sequence) # Evaluate missing data symbol if undefined self._eval_missing_symbol(seq) self.cur.execute("INSERT INTO [{}] VALUES" " (?, ?, ?)".format( self.table_name), (idx, taxa, seq)) self.taxa_list.append(taxa) self.taxa_idx[taxa] = idx if not self.locus_length: self.locus_length = len(seq) size_list.append(len(seq)) # sequence_data.append((taxa, "".join(sequence))) sequence = [] idx += 1 taxa = line[1:].strip() taxa = self.rm_illegal(taxa) elif line.strip() != "" and taxa: sequence.append(line.strip().lower(). replace(" ", "").replace("*", "")) if sequence: seq = "".join(sequence) # Evaluate missing data symbol if undefined self._eval_missing_symbol(seq) self.cur.execute("INSERT INTO [{}] VALUES" " (?, ?, ?)".format( self.table_name), (idx, taxa, seq)) self.taxa_list.append(taxa) self.taxa_idx[taxa] = idx if not self.locus_length: self.locus_length = len(seq) size_list.append(len(seq)) self.partitions.set_length(self.locus_length) # Updating partitions object self.partitions.add_partition(self.name, self.locus_length, file_name=self.path) fh.close() return size_list
[docs] def _read_loci(self): """Alignment parser for pyRAD and ipyrad loci format. Returns ------- size_list : list List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in `read_alignment`. See Also -------- read_alignment """ fh = open(self.path) # Variable storing the lenght of each sequence size_list = [] taxa_list = self.get_loci_taxa(self.path) # Create empty dict sequence_data = dict([(tx, []) for tx in taxa_list]) # Add a counter to name each locus locus_c = 1 # This variable is used in comparison with taxa_list to check which # taxa are missing from the current partition. Missing taxa # are filled with missing data. present_taxa = [] for line in fh: if not line.strip().startswith("//") and line.strip() != "": fields = line.strip().split() taxon = fields[0].lstrip(">") present_taxa.append(taxon) sequence_data[taxon].append(fields[1].lower()) elif line.strip().startswith("//"): locus_len = len(fields[1]) self.locus_length += locus_len size_list.append(locus_len) # Adding missing data for tx in taxa_list: if tx not in present_taxa: sequence_data[tx].append(self.sequence_code[1] * locus_len) present_taxa = [] self.partitions.add_partition("locus_{}".format(locus_c), locus_len, file_name=self.path) locus_c += 1 sequence_data = [(p, tx, "".join(seq)) for p, (tx, seq) in enumerate(sorted(sequence_data.items()))] self.cur.executemany("INSERT INTO [{}] VALUES (?, ?, ?)".format( self.table_name), sequence_data) self.partitions.set_length(self.locus_length) fh.close() return size_list
[docs] def _read_nexus(self): """Alignment parser for nexus format. Parses a nexus alignment file and stores taxa and sequence data in the database. Returns ------- size_list : list List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in `read_alignment`. See Also -------- read_alignment """ fh = open(self.path) # Variable storing the lenght of each sequence size_list = [] counter = 0 idx = 0 ntaxa = None interleave = None for line in fh: # Fetch the number of taxa from nexus header. This will be # necessary for efficient interleave parsing if not ntaxa: try: ntaxa = int(re.search(r".*ntax=(.+?) ", line).group(1)) except ValueError: self.e = InputError("Could not recognize number" " of taxa from nexus header") return except AttributeError: pass # Fetch the number of sites from nexus header. This will be # necessary for efficient interleave parsing if not self.locus_length: try: self.locus_length = int( re.search(r".*nchar=(.+?)[ ,;]", line).group(1)) except ValueError: self.e = InputError("Could not recognize number" " of sites from nexus header") return except AttributeError: pass # Fetch the interleave value from the nexus header. This # will be necessary for efficient interleave parsing if interleave == None: try: interleave = re.search(r"interleave=(.+?) ", line).group(1).lower() interleave = True if interleave.strip() == "yes" \ else False except AttributeError: pass # Set the counter to beggin data parsing if line.strip().lower() == "matrix" and counter == 0: counter = 1 # Stop sequence parser here elif line.strip() == ";" and counter == 1: counter = 2 # self.locus_length = len(sequence_data[0][2]) self.partitions.set_length(self.locus_length) # Start parsing here elif counter == 1: # If neither ntaxa or interleave are defined, return # with an error in the 'e' attribute if not ntaxa or interleave == None: self.e = InputError("Could not recognize ntax or" " interleave parameters from" " nexus header") return if interleave == False: # In interleave, this marks the change in matrix blocks if not line.strip(): continue # Get taxon name taxa = line.strip().split()[0].replace(" ", "") taxa = self.rm_illegal(taxa) # Update taxa_list and taxa_idx attributes self.taxa_list.append(taxa) self.taxa_idx[taxa] = idx # Get sequence string seq = "".join(line.strip().split()[1].split()) # Evaluate missing data symbol if undefined self._eval_missing_symbol(seq) size_list.append(len(seq)) # Add sequence to sqlite database self.cur.execute( "INSERT INTO [{}] VALUES (?, ?, ?)".format( self.table_name), (idx, taxa, seq)) idx += 1 else: self._read_interleave_nexus(ntaxa) counter = 2 # If partitions are specified using the charset command, this # section will parse the partitions elif line.strip().startswith("charset"): self.partitions.read_from_nexus_string(line, file_name=self.path) # If substitution models are specified using the lset or prset # commands, this will parse the model parameters if ("lset" in line.lower() or "prset" in line.lower()) and \ counter == 2: self.partitions.parse_nexus_model(line) # If no partitions have been added during the parsing of the nexus # file, set a single partition if self.partitions.partitions == OrderedDict(): self.partitions.add_partition(self.name, self.locus_length, file_name=self.path) fh.close() return size_list
[docs] def _read_stockholm(self): """Alignment parser for stockholm format. Parses a stockholm alignment file and stores taxa and sequence data in the database. Returns ------- size_list : list List of sequence size (int) for each taxon in the alignment file. Used to check size consistency at the end of the alignment parsing in `read_alignment`. See Also -------- read_alignment """ fh = open(self.path) # Variable storing the lenght of each sequence size_list = [] sequence_data = [] # Skip first header line next(fh) for line in fh: # Skip header and comments if line.startswith("#"): pass # End of file elif line.strip() == "//": break # Parse sequence data elif line.strip() != "": taxa = line.split()[0].split("/")[0] taxa = self.rm_illegal(taxa) try: sequence = line.split()[1].strip().lower() # Evaluate missing data symbol if undefined self._eval_missing_symbol(sequence) except IndexError: sequence = "" size_list.append(len(sequence)) sequence_data.append((taxa, sequence)) sequence_data = [(p, x, y) for p, (x, y) in enumerate(sorted(sequence_data))] self.cur.executemany("INSERT INTO [{}] VALUES (?, ?, ?)".format( self.table_name), sequence_data) self.locus_length = len(sequence_data[0][1]) self.partitions.set_length(self.locus_length) # Updating partitions object self.partitions.add_partition(self.name, self.locus_length, file_name=self.path) fh.close() return size_list
[docs] def read_alignment(self, size_check=True): """Main alignment parser method. This is the main alignment parsing method that is called when the `Alignment` object is instantiated with a file path as the argument. Given the alignment format automatically detected in `__init__`, it calls the specific method that parses that alignment format. After the execution of this method, all attributes of the class will be set and the full range of methods can be applied. Parameters ---------- size_check : bool, optional If True, perform a check for sequence size consistency across taxa (default is True). """ parsing_methods = { "phylip": self._read_phylip, "fasta": self._read_fasta, "loci": self._read_loci, "nexus": self._read_nexus, "stockholm": self._read_stockholm } size_list = parsing_methods[self.input_format]() # If the missing data symbol could not be evaluated during alignment # parsing, set the defaults default_missing = {"DNA": "n", "Protein": "x"} if not self.sequence_code[1]: self.sequence_code[1] = default_missing[self.sequence_code[0]] # Checks the size consistency of the alignment if size_check is True: if len(set(size_list)) > 1: self.e = AlignmentUnequalLength() return if len(self.taxa_list) != len(set(self.taxa_list)): duplicate_taxa = self.duplicate_taxa(self.taxa_list) self.e = DuplicateTaxa("The following taxa were duplicated in" " the alignment: {}".format( "; ".join(duplicate_taxa)))
[docs] def remove_taxa(self, taxa_list_file, mode="remove"): """ Removes taxa from the `Alignment` object. Removes taxa based on the `taxa_list_file` argument from the `Alignment` object. This method supports a list or path to CSV file as the `taxa_list_file` argument. In case of CSV path, it parses the CSV file into a list. The CSV file should be a simple text file containing taxon names in each line. This method also supports two removal modes. - remove: removes the specified taxa - inverse: removes all but the specified taxa Parameters ---------- taxa_list_file : list or string List containing taxa names or string with path to a CSV file containig the taxa names. mode : {"remove", "inverse"}, optional The removal mode (default is remove). """ def remove(list_taxa): for tx in list_taxa: self.cur.execute( "DELETE FROM [{}] WHERE txId=?".format(self.table_name), (self.taxa_idx[tx],)) self.taxa_list.remove(tx) del self.taxa_idx[tx] def inverse(list_taxa): for tx in list(self.taxa_list): if tx not in list_taxa: self.cur.execute( "DELETE FROM [{}] WHERE txId=?".format( self.table_name), (self.taxa_idx[tx],)) self.taxa_list.remove(tx) del self.taxa_idx[tx] # Checking if taxa_list is an input csv file: try: file_handle = open(taxa_list_file[0]) taxa_list = self.read_basic_csv(file_handle) # If not, then the method's argument is already the final list except (IOError, IndexError): taxa_list = taxa_list_file # Filter taxa_list for the taxa that are actually present in this # Alignment object taxa_list = [x for x in taxa_list if x in self.taxa_list] if mode == "remove": remove(taxa_list) if mode == "inverse": inverse(taxa_list)
[docs] def change_taxon_name(self, old_name, new_name): """Changes the name of a particular taxon. Parameters ---------- old_name : str Original taxon name. new_name : str New taxon name. """ # Change in taxa_list if old_name in self.taxa_list: self.taxa_list[self.taxa_list.index(old_name)] = new_name # Change in the database self.cur.execute("UPDATE [{}] SET taxon=? WHERE txId=?".format( self.table_name), (new_name, self.taxa_idx[old_name])) # Change in taxa_index self.taxa_idx[new_name] = self.taxa_idx[old_name] del self.taxa_idx[old_name]
@setup_database
[docs] def collapse(self, write_haplotypes=True, haplotypes_file=None, dest=".", conversion_suffix="", haplotype_name="Hap", table_in=None, table_out="collapsed", use_main_table=False, ns=None, pbar=None): """Collapses equal sequences into unique haplotypes. Collapses equal sequences into unique haplotypes. This method fetches the sequences for the current alignment and creates a new database table with the collapsed haplotypes. Parameters ---------- write_haplotypes : bool, optional If True, a file mapping the taxa names name to their respective haplotype name will be generated (default is True). haplotypes_file : string, optional Name of the file mapping taxa names to their respective haplotype. Only used when `write_haplotypes` is True. If it not provided and `write_haplotypes` is True, the file name will be determined from `Alignment.name`. dest : string, optional Path of directory where `haplotypes_file` will be generated (default is "."). conversion_suffix : string, optional Suffix appended to the haplotypes file. Only used when `write_haplotypes` is True and `haplotypes_file` is None. haplotype_name : string, optional Prefix of the haplotype string. The final haplotype name will be `haplotype_name` + <int> (default is "Hap"). table_in : string, optional Name of database table containing the alignment data that is used for this operation. table_out : string, optional Name of database table where the final alignment will be inserted (default is "collapsed"). use_main_table : bool, optional If True, both `table_in` and `table_out` are ignore and the main table `Alignment.table_name` is used as the input and output table (default is False). ns : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. See Also -------- SetupDatabase Notes ----- In order to collapse an alignment, all sequence would need to be compared and somehow stored until the last sequence is processed. This creates the immediate issue that, if all sequences are different, the entire alignment could have to be stored in memory. To increase memory efficiency and increase performance, sequences are converted into hashes, and these hashes are compared among each other instead of the actual sequences. This means that instead of comparing potentially very long string and maintaining them in memory, only a small hash string is maintained and compared. In this way, only a fraction of the information is stored in memory and the comparisons are much faster. """ self._set_pipes(ns, pbar, len(self.taxa_list)) # Create temporary table for storing unique sequences self._create_table(".collapsed") # Create a new cursor to edit the database while iterating over # table_in col_cur = self.con.cursor() # Dict that will store the hashes of each sequence. It will be used to # check if the current sequence has already been processed. The value # will be the haplotype hash_list = {} # hap_dic will store the haplotype name as key and a list of the # taxa with the same sequence as a list value hap_dic = {} hap_counter = 0 for p, (taxa, seq) in enumerate(self.iter_alignment( table_name=table_in)): self._update_pipes(ns, pbar, p + 1, "Collapsing taxon {}".format(taxa)) cur_hash = hash(seq) # If the current hash is unique, add it to the database if cur_hash not in hash_list: # Create name for new haplotype and update other haplotype # related variables haplotype = "{}_{}".format(haplotype_name, hap_counter + 1) hash_list[cur_hash] = haplotype hap_dic[haplotype] = [taxa] # Add data to table col_cur.execute("INSERT INTO [.collapsed] " "VALUES (?, ?, ?)", (hap_counter, unicode(haplotype), seq)) hap_counter += 1 else: hap_dic[hash_list[cur_hash]].append(taxa) # The collapse operation is special in the sense that the former taxon # names are no longer valid. Therefore, we drop the previous table and # populate a new one with the collapsed data if self._table_exists(table_out): self.cur.execute("DROP TABLE [{}];".format(table_out)) # Replace table_out with collapsed table self.cur.execute("ALTER TABLE [.collapsed] RENAME TO [{}]".format( table_out)) if write_haplotypes is True: # If no output file for the haplotype correspondence is provided, # use the input alignment name as reference if not haplotypes_file: haplotypes_file = self.name.split(".")[0] + conversion_suffix self.write_loci_correspondence(hap_dic, haplotypes_file, dest) self._reset_pipes(ns)
@staticmethod
[docs] def write_loci_correspondence(hap_dict, output_file, dest="./"): """Writes the file mapping taxa to unique haplotypes for `collapse`. Parameters ---------- hap_dict : dict Dictionary mapping the haplotype (key) to a list of taxa (value) sharing the same haplotype. output_file : str Name of the haplotype correspondence file dest : str, optional Path to directory where the `output_file` will be written. """ output_handle = open(join(dest, output_file + ".haplotypes"), "w") for haplotype, taxa_list in sorted(hap_dict.items()): output_handle.write("%s: %s\n" % (haplotype, "; ".join(taxa_list))) output_handle.close()
@setup_database
[docs] def consensus(self, consensus_type, table_in=None, table_out="consensus", use_main_table=False, ns=None, pbar=None): """Creates a consensus sequence from the alignment. Converts the current `Alignment` alignment data into a single consensus sequence. The `consensus_type` argument determines how variation in the original alignment is handled for the generation of the consensus sequence. The options for handling sequence variation are: - IUPAC: Converts variable sites according to the corresponding IUPAC symbols (DNA sequence type only) - Soft mask: Converts variable sites into missing data - Remove: Removes variable sites - First sequence: Uses the first sequence in the dictionary Parameters ---------- consensus_type : {"IUPAC", "Soft mask", "Remove", "First sequence"} Type of variation handling. See summary above. table_in : string, optional Name of database table containing the alignment data that is used for this operation. table_out : string, optional Name of database table where the final alignment will be inserted (default is "consensus"). use_main_table : bool, optional If True, both `table_in` and `table_out` are ignore and the main table `Alignment.table_name` is used as the input and output table (default is False). ns : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. See Also -------- SetupDatabase """ self._set_pipes(ns, pbar, self.locus_length) # Empty consensus sequence consensus_seq = [] # If sequence type is first sequence if consensus_type == "First sequence": # Grab first available sequence consensus_seq = "" idx = 0 while not consensus_seq: try: consensus_seq = [self.cur.execute( "SELECT seq from [{}] WHERE txId={}".format( table_in, idx)).fetchone()[0]] except TypeError: idx += 1 for p, column in enumerate(self.iter_columns(table_name=table_in)): self._update_pipes(ns, pbar, value=p + 1) column = list(set(column)) # If invariable, include in consensus and skip. Increases speed # when there are no missing or gap characters if len(column) == 1: consensus_seq.append(column[0]) continue # Remove missing data and gaps column = sorted([x for x in column if x != self.sequence_code[1] and x != "-"]) # In case there is only missing/gap characters if not column: consensus_seq.append(self.sequence_code[1]) continue # Check for invariable without missing data if len(column) == 1: consensus_seq.append(column[0]) continue if consensus_type == "IUPAC": # Convert variation into IUPAC code try: consensus_seq.append(iupac["".join(column)]) except KeyError: iupac_code = sorted(set(list(itertools.chain.from_iterable( [x if x in dna_chars else iupac_rev[x] for x in column])))) consensus_seq.append(iupac["".join(iupac_code)]) continue elif consensus_type == "Soft mask": consensus_seq.append(self.sequence_code[1]) continue elif consensus_type == "Remove": continue # The consensus operation is special in the sense that the former taxon # names are no longer valid. Therefore, we drop the previous table and # populate a new one with the consensus data if self._table_exists(table_out): self.cur.execute("DELETE FROM [{}];".format(table_out)) # Insert into database self.cur.execute("INSERT INTO [{}] VALUES(?, ?, ?)".format( table_out), (0, "consensus", "".join(consensus_seq))) self.taxa_list = ["consensus"] self.taxa_idx["consensus"] = 0 if ns: if ns.stop: raise KillByUser("") if ns.sa: ns.counter = ns.total = None
[docs] def _check_partitions(self, partition_obj): """Consistency check of the `partition_obj` for `Alignment` object. Performs checks to ensure that a `partition_obj` is consistent with the alignment data of the `Alignment` object. Parameters ---------- partition_obj : trifusion.process.data.Partitions Partitions object. See Also -------- trifusion.process.data.Partitions """ # Checks if total lenght of partitions matches the lenght of the # current alignment if partition_obj.counter != self.locus_length: return process.data.InvalidPartitionFile("Partitions in partition" " file are inconsistent with current alignment. Last " "partition range: {}; Alignmet length: {}".format( partition_obj.counter, self.locus_length ))
[docs] def set_partitions(self, partitions): """Updates the `partition` attribute. Parameters ---------- partitions : trifusion.process.data.Partitions Partitions object. """ # Checks partition's consistency er = self._check_partitions(partitions) if isinstance(er, process.data.InvalidPartitionFile): return er else: self.partitions = partitions
[docs] def reverse_concatenate(self, table_in="", db_con=None, ns=None, pbar=None): """Reverses the alignment according to the `partitions` attribute. This method splits the current alignment according to the partitions set in the `partitions` attribute. Returns an `AlignmentList` object where each partition is an `Alignment` object. Parameters ---------- table_in : string Name of database table containing the alignment data that is used for this operation. db_con : sqlite3.Connection Database connection object that will be provided to each new `Alignment` object. ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. """ def add_alignment(part_name, part_len, taxa_list, taxa_idx): # Create Alignment object current_aln = Alignment(part_name, input_format=self.input_format, sql_cursor=self.cur, locus_length=part_len, sequence_code=self.sequence_code, taxa_list=taxa_list, taxa_idx=taxa_idx) alns.append(current_aln) taxa_list_master = defaultdict(list) taxa_idx_master = defaultdict(dict) self._set_pipes(ns, pbar, total=len(self.taxa_list)) rev_cur = self.con.cursor() for p, taxon in enumerate(self.taxa_list): seq = self.get_sequence(taxon, table_name=table_in) self._update_pipes(ns, pbar, value=p + 1) for name, part_range in self.partitions: name = "".join([x for x in name.split(".")[0] if x.isalnum()]) if part_range[1]: for i in range(3): part_seq = seq[part_range[0][0]: part_range[0][1] + 1][i::3] if part_seq.replace(self.sequence_code[1], "") == "": continue cname = name + str(i) taxa_list_master[cname].append(taxon) taxa_idx_master[cname][taxon] = p if not self._table_exists(cname, cur=rev_cur): self._create_table(cname, cur=rev_cur) rev_cur.execute( "INSERT INTO [{}] VALUES" "(?, ?, ?)".format(cname), (p, taxon, part_seq)) else: part_seq = seq[part_range[0][0]: part_range[0][1] + 1] if part_seq.replace(self.sequence_code[1], "") == "": continue taxa_list_master[name].append(taxon) taxa_idx_master[name][taxon] = p if not self._table_exists(name, cur=rev_cur): self._create_table(name, cur=rev_cur) rev_cur.execute( "INSERT INTO [{}] VALUES" "(?, ?, ?)".format(name), (p, taxon, part_seq)) concatenated_aln = AlignmentList([], db_con=db_con, db_cur=self.cur) alns = [] # Create Alignment objects for each partition for name, part_range in self.partitions: name = "".join([x for x in name.split(".")[0] if x.isalnum()]) if part_range[1]: for i in range(3): part_len = (part_range[0][1] - part_range[0][0] + 1) / 3 taxa_list = taxa_list_master[name + str(i)] taxa_idx = taxa_idx_master[name + str(i)] add_alignment(name + str(i), part_len, taxa_list=taxa_list, taxa_idx=taxa_idx) else: part_len = part_range[0][1] - part_range[0][0] + 1 taxa_list = taxa_list_master[name] taxa_idx = taxa_idx_master[name] add_alignment(name, part_len, taxa_list=taxa_list, taxa_idx=taxa_idx) concatenated_aln.add_alignments(alns, ignore_paths=True) return concatenated_aln
@setup_database
[docs] def filter_codon_positions(self, position_list, table_in=None, table_out="filter", ns=None, use_main_table=False): """ Filter alignment according to codon positions. Filters codon positions from `Alignment` object with DNA sequence type. The `position_list` argument determines which codon positions will be stored. It should be a three element list, corresponding to the three codon position, and the positions that are saved are the ones with a True value. Parameters ---------- position_list : list List of three bool elements that correspond to each codon position. Ex. [True, True, True] will save all positions while [True, True, False] will exclude the third codon position table_in : string, optional Name of database table containing the alignment data that is used for this operation. table_out : string, optional Name of database table where the final alignment will be inserted (default is "filter"). use_main_table : bool, optional If True, both `table_in` and `table_out` are ignore and the main table `Alignment.table_name` is used as the input and output table (default is False). ns : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. See Also -------- SetupDatabase """ def index(length, pos): """ index generator """ for _ in range(0, length, 3): for j in pos: if j: yield 1 else: yield 0 # Create temporary table for storing unique sequences self._create_table(".codonfilter") # Create a new cursor to edit the database while iterating over # table_in cod_cur = self.con.cursor() for taxon, seq in self.iter_alignment(table_name=table_in): if ns: if ns.stop: raise KillByUser("") seq = "".join(list( itertools.compress(seq, index(self.locus_length, position_list)))) cod_cur.execute("INSERT INTO [.codonfilter] VALUES " "(?, ?, ?)", (self.taxa_idx[taxon], taxon, seq)) if self._table_exists(table_out): self.cur.execute("DROP TABLE [{}]".format(table_out)) # Replace table_out with collapsed table self.cur.execute( "ALTER TABLE [.codonfilter] RENAME TO [{}]".format( table_out)) self.locus_length = len(seq)
@setup_database
[docs] def code_gaps(self, table_out="gaps", table_in=None, use_main_table=False, pbar=None, ns=None): """Code gap patterns as binary matrix at the end of alignment. This method codes gaps present in the alignment in binary format, according to the method of Simmons and Ochoterena (2000), to be read by phylogenetic programs such as MrBayes. The resultant alignment, however, can only be output in the Nexus format. Parameters ---------- table_in : string, optional Name of database table containing the alignment data that is used for this operation. table_out : string, optional Name of database table where the final alignment will be inserted (default is "gaps"). use_main_table : bool, optional If True, both `table_in` and `table_out` are ignore and the main table `Alignment.table_name` is used as the input and output table (default is False). ns : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. See Also -------- SetupDatabase """ self._set_pipes(ns, pbar, total=len(self.taxa_list)) def gap_listing(sequence, gap_symbol="-"): """ Function that parses a sequence string and returns the position of indel events. The returned list is composed of tuples with the span of each indel """ gap = "%s+" % gap_symbol span_regex = "" gap_list, seq_start = [], 0 while span_regex is not None: span_regex = re.search(gap, sequence) if span_regex is not None and seq_start == 0: gap_list.append(span_regex.span()) sequence = sequence[span_regex.span()[1] + 1:] seq_start = span_regex.span()[1] + 1 elif span_regex is not None and seq_start != 0: gap_list.append((span_regex.span()[0] + seq_start, span_regex.span()[1] + seq_start)) sequence = sequence[span_regex.span()[1] + 1:] seq_start += span_regex.span()[1] + 1 return gap_list def gap_binary_generator(sequence, gap_list): """ This function contains the algorithm to construct the binary state block for the indel events """ for cur_gap in gap_list: cur_gap_start, cur_gap_end = cur_gap if sequence[cur_gap_start:cur_gap_end] == "-" * \ (cur_gap_end - cur_gap_start) and \ sequence[cur_gap_start - 1] != "-" and \ sequence[cur_gap_end] != "-": sequence += "1" elif sequence[cur_gap_start:cur_gap_end] == "-" * \ (cur_gap_end - cur_gap_start): if sequence[cur_gap_start - 1] == "-" or \ sequence[cur_gap_end] == "-": sequence += "-" elif sequence[cur_gap_start:cur_gap_end] != "-" * \ (cur_gap_end - cur_gap_start): sequence += "0" return sequence complete_gap_list = [] sequence_data = [] # Get the complete list of unique gap positions in the alignment for p, (tx, seq) in enumerate( self.iter_alignment(table_name=table_in)): self._update_pipes(ns, pbar, value=p + 1, msg="Fetching gaps for {}".format(tx)) current_list = gap_listing(seq) complete_gap_list += [gap for gap in current_list if gap not in complete_gap_list] self._set_pipes(ns, pbar, total=len(self.taxa_list)) # This will add the binary matrix of the unique gaps listed at the # end of each alignment sequence for p, (tx, seq) in enumerate( self.iter_alignment(table_name=table_in)): self._update_pipes(ns, pbar, value=p + 1, msg="Adding gaps for {}".format(tx)) final_seq = gap_binary_generator(seq, complete_gap_list) # Check if input and output tables are the same. If they are, # it means that the output table already exists and is being # updated if table_in == table_out: sequence_data.append((final_seq, self.taxa_idx[tx])) else: sequence_data.append((self.taxa_idx[tx], tx, final_seq)) # Populate/modify table if table_in == table_out: self.cur.executemany("UPDATE [{}] SET seq=? WHERE txId=?".format( table_out), sequence_data) else: self.cur.executemany("INSERT INTO [{}] VALUES (?, ?, ?)".format( table_out), sequence_data) self.restriction_range = "%s-%s" % (int(self.locus_length), len(complete_gap_list) + int(self.locus_length) - 1) self._reset_pipes(ns)
[docs] def _filter_terminals(self, table_in, table_out, ns=None): """Replace gaps at alignment's extremities with missing data. This method should not be called directly. Instead it is used by the `filter_missing_data` method. Parameters ---------- table_in : string Name of database table containing the alignment data that is used for this operation. table_out : string Name of database table where the final alignment will be inserted (default is "collapsed"). ns : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. See Also -------- filter_missing_data """ self._set_pipes(ns, None, total=len(self.taxa_list)) self._create_table(".filterterminals") filt_cur = self.con.cursor() for p, (tx, seq) in enumerate( self.iter_alignment(table_name=table_in)): self._update_pipes(ns, None, value=p + 1, msg="Filtering terminals") # Condition where the sequence only has gaps if not seq.strip("-"): seq = self.sequence_code[1] * len(seq) filt_cur.execute("INSERT INTO [.filterterminals] " "VALUES (?, ?, ?)", (self.taxa_idx[tx], tx, seq)) continue seq = list(seq) counter, reverse_counter = 0, -1 while seq[counter] == "-": seq[counter] = self.sequence_code[1] counter += 1 while seq[reverse_counter] == "-": seq[reverse_counter] = self.sequence_code[1] reverse_counter -= 1 filt_cur.execute("INSERT INTO [.filterterminals] " "VALUES (?, ?, ?)", (self.taxa_idx[tx], tx, "".join(seq))) # Check if input and output tables are the same. If they are, # drop the old table and replace with this new one if self._table_exists(table_out): self.cur.execute("DROP TABLE [{}]".format(table_out)) self.cur.execute("ALTER TABLE [.filterterminals] " "RENAME TO [{}]".format(table_out))
[docs] def _filter_columns(self, gap_threshold, missing_threshold, table_in, table_out, ns=None): """Filters alignment columns based on missing data content. Filters alignment columns based on the amount of missing data and gap content. If the calculated content is higher than a defined threshold, the column is removed from the final alignment. Setting `gap_threshold` and `missing_threshold` to 0 results in an alignment with no missing data, while setting them to 100 will not remove any column. This method should not be called directly. Instead it is used by the `filter_missing_data` method. Parameters ---------- gap_threshold : int Integer between 0 and 100 defining the percentage above which a column with that gap percentage is removed. missing_threshold : int Integer between 0 and 100 defining the percentage above which a column with that gap+missing percentage is removed. table_in : string Name of database table containing the alignment data that is used for this operation. table_out : string Name of database table where the final alignment will be inserted (default is "collapsed"). ns : multiprocessing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. See Also -------- filter_missing_data """ self._set_pipes(ns, None, total=self.locus_length) taxa_number = float(len(self.taxa_list)) filtered_cols = [] self._create_table(".filtercolumns") filt_cur = self.con.cursor() # Creating the column list variable for p, column in enumerate(self.iter_columns(table_name=table_in)): self._update_pipes(ns, None, value=p + 1, msg="Filtering columns") cadd = column.count # Calculating metrics gap_proportion = (float(cadd("-")) / taxa_number) * float(100) missing_proportion = (float(cadd(self.sequence_code[1])) / taxa_number) * float(100) total_missing_proportion = gap_proportion + missing_proportion if gap_proportion <= gap_threshold and \ total_missing_proportion <= missing_threshold: filtered_cols.append(1) else: filtered_cols.append(0) for tx, seq in self.iter_alignment(table_name=table_in): if ns: if ns.stop: raise KillByUser("") seq = "".join(compress(seq, filtered_cols)) filt_cur.execute("INSERT INTO [.filtercolumns] " "VALUES (?, ?, ?)", (self.taxa_idx[tx], tx, seq)) # Check if input and output tables are the same. If they are, # it means that the output table already exists and is being # updated if self._table_exists(table_out): self.cur.execute("DROP TABLE [{}];".format(table_out)) self.cur.execute("ALTER TABLE [.filtercolumns] RENAME TO [{}]".format( table_out)) self.locus_length = len(seq)
@setup_database
[docs] def filter_missing_data(self, gap_threshold, missing_threshold, table_in=None, table_out="filter", ns=None, use_main_table=False): """Filters alignment according to missing data. Filters gaps and true missing data from the alignment using tolerance thresholds for each type of missing data. Both thresholds are maximum percentages of sites in an alignment column containing the type of missing data. If gap_threshold=50, for example, alignment columns with more than 50% of sites with gaps are removed. Parameters ---------- gap_threshold : int Integer between 0 and 100 defining the percentage above which a column with that gap percentage is removed. missing_threshold : int Integer between 0 and 100 defining the percentage above which a column with that gap+missing percentage is removed. table_in : string, optional Name of database table containing the alignment data that is used for this operation. table_out : string, optional Name of database table where the final alignment will be inserted (default is "filter"). use_main_table : bool, optional If True, both `table_in` and `table_out` are ignore and the main table `Alignment.table_name` is used as the input and output table (default is False). ns : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. """ self._filter_terminals(table_in=table_in, table_out=table_out, ns=ns) # Update output table. This will make the _filter_columns method modify # the alignment produced in the _filter_terminals method. table_in = table_out self._filter_columns(gap_threshold, missing_threshold, table_in=table_in, table_out=table_out, ns=ns)
@staticmethod
[docs] def _test_range(s, min_val, max_val): """Test if a given `s` integer in inside a specified range. Wrapper for the tests that determine whether a certain value (`s`) is within the range provided by `min_val` and `max_val`. Both of these arguments can be None, in which case there are not lower and/or upper boundaries, respectively. Parameters ---------- s : int Test value min_val : int Minimum range allowed for the test value. max_val : int Maximum range allowed for the test value. Returns ------- _ : bool True if `s` within boundaries. False if not. """ # If both values were specified, check if s is within range if max_val is not None and min_val is not None \ and s >= min_val and s <= max_val: return True # If only min_val was specified, check if s is higher elif max_val is None and s >= min_val: return True # If only max_val was specified, check if s is lower elif min_val is None and s <= max_val: return True else: return False
[docs] def filter_segregating_sites(self, min_val, max_val, table_in=None, ns=None, pbar=None): """Checks if the segregating sites from alignment are within range. Evaluates the number of segregating sites of the current alignment and returns True if it falls between the range specified by the `min_val` and `max_val` arguments. Parameters ---------- min_val : int Minimum number of segregating sites for the alignment to pass. Can be None, in which case there is no lower bound. max_val : int Maximum number of segregating sites for the alignment to pass. Can be None, in which case there is no upper bound. table_in : string, optional Name of database table containing the alignment data that is used for this operation. ns : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. Returns ------- _ : bool True if the number of segregating sites are within the specified range. False if not. See Also -------- filter_informative_sites """ if pbar: pbar.max_value = self.locus_length # Counter for segregating sites s = 0 # Creating the column list variable for p, column in enumerate(self.iter_columns_uniq( table_name=table_in)): if ns: if ns.stop: raise KillByUser("") if pbar: pbar.update(p + 1) v = len([i for i in column if i not in [self.sequence_code[1], "-"]]) if v == 1: continue if v > 1: s += 1 # Add these tests so that the method may exit earlier if the # conditions are met, precluding the analysis of the entire # alignment if min_val and s >= min_val and max_val is None: return True if max_val and s > max_val: return False return self._test_range(s, min_val, max_val)
[docs] def filter_informative_sites(self, min_val, max_val, table_in=None, ns=None, pbar=None): """Checks if the variable sites from alignment are within range. Similar to `filter_segregating_sites` method, but only considers informative sites (variable sites present in more than 2 taxa). Parameters ---------- min_val : int Minimum number of informative sites for the alignment to pass. Can be None, in which case there is no lower bound. max_val : int Maximum number of informative sites for the alignment to pass. Can be None, in which case there is no upper bound. table_in : string, optional Name of database table containing the alignment data that is used for this operation. ns : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. Returns ------- _ : bool True if the number of informative sites are within the specified range. False if not. See Also -------- filter_segregating_sites """ if pbar: pbar.max_value = self.locus_length # Counter for informative sites s = 0 # Creating the column list variable for p, column in enumerate(self.iter_columns(table_name=table_in)): if pbar: pbar.update(p + 1) if ns: if ns.stop: raise KillByUser("") column = Counter([i for i in column if i not in [self.sequence_code[1], "-"]]) if column: # Skip if column has no variation if len(column) == 1: continue # Delete most common del column[column.most_common()[0][0]] # If any of the remaining sites is present in more than two # taxa score the site as informative if any([x >= 2 for x in column.values()]): s += 1 # Add these tests so that the method may exit earlier if the # conditions are met, precluding the analysis of the entire # alignment if min_val and s >= min_val and max_val is None: return True if max_val and s > max_val: return False return self._test_range(s, min_val, max_val)
[docs] def _get_interleave_data(self, table_name=None, table_suffix=None, ns_pipe=None, pbar=None): """Builds an interleave matrix on the database. When the user requests an interleave output format, this method rearranges the data matrix (which is leave in the database) in interleave format in a new table. This should be performed only once, even when multiple formats are specified in interleave. The first one calls this class, and the subsequent ones use the this table. Parameters ---------- ns_pipe : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. table_suffix : string, optional Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string, optional Name of the table from where the sequence data is fetched. """ if self.cur.execute( "SELECT name FROM sqlite_master WHERE type='table' AND" " name='.interleavedata'").fetchall(): self.cur.execute("DELETE FROM [.interleavedata]") else: self.cur.execute("CREATE TABLE [.interleavedata](" "taxon TEXT," "slice INT," "seq TEXT)") self.cur.execute("CREATE INDEX interindex ON " "[.interleavedata](slice)") temp_cur = self.con.cursor() for c, (tx, seq) in enumerate(self.iter_alignment( table_name=table_name, table_suffix=table_suffix)): self._update_pipes(ns_pipe, pbar, value=c + 1) counter = 0 for i in xrange(90, self.locus_length, 90): temp_cur.execute("INSERT INTO [.interleavedata] VALUES" " (?, ?, ?)", (tx, i, seq[counter:i])) # temp_storage[p].append(seq[counter:i]) counter = i try: if self.locus_length % 90: i += 1 temp_cur.execute("INSERT INTO [.interleavedata] VALUES" " (?, ?, ?)", (tx, i, seq[counter:])) # temp_storage[p].append(seq[c:],) # This likely means that the alignment is less than # 90 characters long. In this case, append to the storage's # 0 index except UnboundLocalError: temp_cur.execute("INSERT INTO [.interleavedata] VALUES " "(?, ?, ?)", (tx, 0, seq)) return True
[docs] def _write_fasta(self, output_file, **kwargs): """Writes `Alignment` object into fasta output format. Parameters ---------- output_file : str Name of the output file. If using `ns_pipe` in TriFusion, it will prompt the user if the output file already exists. interleave : bool, optional Determines whether the output alignment will be in leave (False) or interleave (True) format. Not all output formats support this option. ld_hat : bool, optional If True, the fasta output format will include a first line compliant with the format of LD Hat and will truncate sequence names and sequence length per line accordingly. ns_pipe : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. table_suffix : string, optional Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string, optional Name of the table from where the sequence data is fetched. """ # Get relevant keyword arguments ld_hat = kwargs.get("ld_hat", False) interleave = kwargs.get("interleave", False) table_suffix = kwargs.get("table_suffix", None) table_name = kwargs.get("table_name", None) ns_pipe = kwargs.get("ns_pipe", None) pbar = kwargs.get("pbar", None) out_file = open("{}.fas".format(output_file), "w") # If LD HAT sub format has been specificed, write the first line # containing the number of sequences, sites and genotype phase if ld_hat: out_file.write("{} {} {}\n".format( len(self.taxa_list), self.locus_length, "2")) self._set_pipes(ns_pipe, pbar, total=len(self.taxa_list), msg="Writing fasta output") for p, (taxon, seq) in enumerate(self.iter_alignment( table_name=table_name, table_suffix=table_suffix)): self._update_pipes(ns_pipe, pbar, value=p + 1) if ld_hat: # Truncate sequence name to 30 characters out_file.write(">%s\n" % (taxon[:30])) # Limit each sequence line to 2000 characters if len(seq) > 2000: for i in range(0, len(seq), 2000): out_file.write("%s\n" % (seq[i:i + 2000].upper())) elif interleave: out_file.write(">%s\n" % taxon) counter = 0 for i in range(90, self.locus_length, 90): out_file.write("%s\n" % seq[counter:i]) counter = i else: out_file.write("%s\n" % seq[counter:]) else: out_file.write(">%s\n%s\n" % (taxon, seq.upper())) out_file.close()
[docs] def _write_phylip(self, output_file, **kwargs): """Writes `Alignment` object into phylip format. Parameters ---------- output_file : str Name of the output file. If using `ns_pipe` in TriFusion, it will prompt the user if the output file already exists. tx_space_phy : int, optional Space (in characters) provided for the taxon name in phylip format (default is 40). cut_space_phy : int, optional Set the maximum allowed character length for taxon names in phylip format. Longer names are truncated (default is 39). interleave : bool, optional Determines whether the output alignment will be in leave (False) or interleave (True) format. Not all output formats support this option. phy_truncate_names : bool, optional If True, taxon names in phylip format are truncated to a maximum of 10 characters (default is False). partition_file : bool, optional If True, the auxiliary partitions file will be written (default is True). model_phylip : str, optional Substitution model for the auxiliary partition file of phylip format, compliant with RAxML. ns_pipe : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. table_suffix : string, optional Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string, optional Name of the table from where the sequence data is fetched. """ # Get relevant keyword arguments interleave = kwargs.get("interleave", False) tx_space_phy = kwargs.get("tx_space_phy", 40) cut_space_phy = kwargs.get("cut_space_phy", 39) phy_truncate_names = kwargs.get("phy_truncate_names", False) partition_file = kwargs.get("partition_file", None) model_phylip = kwargs.get("model_phylip", None) table_suffix = kwargs.get("table_suffix", None) table_name = kwargs.get("table_name", None) ns_pipe = kwargs.get("ns_pipe", None) pbar = kwargs.get("pbar", None) # Change taxa space if phy_truncate_names option is set to True if phy_truncate_names: cut_space_phy = 10 out_file = open("{}.phy".format(output_file), "w") out_file.write("{} {}\n".format(len(self.taxa_list), self.locus_length)) if interleave: self._set_pipes(ns_pipe, pbar, total=len(self.taxa_list), msg="Writing phylip output") if not self.interleave_data: self.interleave_data = self._get_interleave_data( table_name=table_name, table_suffix=table_suffix, ns_pipe=ns_pipe, pbar=pbar ) write_tx = True prev = 90 for tx, p, seq in self.cur.execute( "SELECT taxon, slice, seq FROM [.interleavedata] " "ORDER BY slice"): if p != prev: out_file.write("\n") prev = p write_tx = False if write_tx: out_file.write("{} {}\n".format( tx[:cut_space_phy].ljust( tx_space_phy), seq.upper())) else: out_file.write(seq.upper() + "\n") else: self._set_pipes(ns_pipe, pbar, total=len(self.taxa_list), msg="Writing phylip output") for p, (taxon, seq) in enumerate(self.iter_alignment( table_name=table_name, table_suffix=table_suffix)): self._update_pipes(ns_pipe, pbar, value=p + 1) out_file.write("%s %s\n" % ( taxon[:cut_space_phy].ljust(tx_space_phy), seq.upper())) # In case there is a concatenated alignment being written if not self.partitions.is_single() and partition_file: partition_file = open(output_file + "_part.File", "w") for name, lrange in self.partitions: # Get model from app if it exists and there are no codon # positions model = model_phylip if self.partitions.models[name] == \ [None] or len( self.partitions.models[name][1]) > 1 \ else self.partitions.models[name][1][0] partition_file.write("%s, %s = %s\n" % ( model if model else self.sequence_code[0], name, "-".join([str(x + 1) for x in lrange[0]]))) partition_file.close() out_file.close()
[docs] def _write_nexus(self, output_file, **kwargs): """Writes `Alignment` object into nexus file format. Parameters ---------- output_format : list List with the output formats to generate. Options are: {"fasta", "phylip", "nexus", "stockholm", "gphocs", "ima2"}. tx_space_nex : int, optional Space (in characters) provided for the taxon name in nexus format (default is 40). cut_space_nex : int, optional Set the maximum allowed character length for taxon names in nexus format. Longer names are truncated (default is 39). interleave : bool, optional Determines whether the output alignment will be in leave (False) or interleave (True) format. Not all output formats support this option. gap : str, optional Symbol for alignment gaps (default is '-'). use_charset : bool, optional If True, partitions from the `Partitions` object will be written in the nexus output format (default is True). use_nexus_models : bool, optional If True, writes the partitions charset block in nexus format. outgroup_list : list, optional Specify list of taxon names that will be defined as the outgroup in the nexus output format. This may be useful for analyses with MrBayes or other software that may require outgroups. ns_pipe : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. table_suffix : string, optional Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string, optional Name of the table from where the sequence data is fetched. """ # Get relevant keyword arguments interleave = kwargs.get("interleave", False) tx_space_nex = kwargs.get("tx_space_nex", 40) cut_space_nex = kwargs.get("cut_space_nex", 39) gap = kwargs.get("gap", "-") use_charset = kwargs.get("use_charset", True) use_nexus_models = kwargs.get("use_nexus_models", True) outgroup_list = kwargs.get("outgroup_list", None) table_suffix = kwargs.get("table_suffix", None) table_name = kwargs.get("table_name", None) ns_pipe = kwargs.get("ns_pipe", None) pbar = kwargs.get("pbar", None) out_file = open("{}.nex".format(output_file), "w") # This writes the output in interleave format if interleave: self._set_pipes(ns_pipe, pbar, total=len(self.taxa_list), msg="Writing nexus output") if self.restriction_range is not None: out_file.write("#NEXUS\n\nBegin data;\n\tdimensions " "ntax=%s nchar=%s ;\n\tformat datatype=" "mixed(%s:1-%s, restriction:%s) interleave=" "yes gap=%s missing=%s ;\n\tmatrix\n" % (len(self.taxa_list), self.locus_length, self.sequence_code[0].upper(), self.locus_length - 1, self.restriction_range, gap, self.sequence_code[1].upper())) else: out_file.write("#NEXUS\n\nBegin data;\n\tdimensions " "ntax=%s nchar=%s ;\n\tformat datatype=%s " "interleave=yes gap=%s missing=%s ;\n\t" "matrix\n" % (len(self.taxa_list), self.locus_length, self.sequence_code[0].upper(), gap, self.sequence_code[1].upper())) if not self.interleave_data: self.interleave_data = self._get_interleave_data( table_suffix=table_suffix, table_name=table_name, ns_pipe=ns_pipe, pbar=pbar ) # Iterate over interleave data prev = 0 for tx, p, seq in self.cur.execute( "SELECT taxon, slice, seq FROM [.interleavedata] " "ORDER BY slice"): if p != prev: out_file.write("\n") prev = p out_file.write("{} {}\n".format( tx[:cut_space_nex].ljust(tx_space_nex), seq.upper())) out_file.write(";\n\tend;") # This writes the output in leave format (default) else: self._set_pipes(ns_pipe, pbar, total=len(self.taxa_list), msg="Writing nexus output") if self.restriction_range is not None: out_file.write("#NEXUS\n\nBegin data;\n\tdimensions " "ntax=%s nchar=%s ;\n\tformat datatype=mixed" "(%s:1-%s, restriction:%s) interleave=no " "gap=%s missing=%s ;\n\tmatrix\n" % (len(self.taxa_list), self.locus_length, self.sequence_code[0], self.locus_length - 1, self.restriction_range, gap, self.sequence_code[1].upper())) else: out_file.write("#NEXUS\n\nBegin data;\n\tdimensions ntax=%s" " nchar=%s ;\n\tformat datatype=%s " "interleave=no gap=%s missing=%s ;\n\t" "matrix\n" % ( len(self.taxa_list), self.locus_length, self.sequence_code[0], gap, self.sequence_code[1].upper())) for p, (taxon, seq) in enumerate(self.iter_alignment( table_name=table_name, table_suffix=table_suffix)): self._update_pipes(ns_pipe, pbar, value=p + 1) out_file.write("%s %s\n" % (taxon[:cut_space_nex].ljust( tx_space_nex), seq)) out_file.write(";\n\tend;") if use_charset: # Writing partitions, if any if not self.partitions.is_single(): out_file.write("\nbegin mrbayes;\n") p = 0 # Full gene partitions for name, lrange in self.partitions: # If there are codon partitions, write those if lrange[1]: for i in lrange[1]: out_file.write("\tcharset %s_%s = %s-%s\\3;\n" % (name, i + 1, i + 1, lrange[0][1] + 1)) p += 1 else: out_file.write("\tcharset %s = %s-%s;\n" % (name, lrange[0][0] + 1, lrange[0][1] + 1)) p += 1 out_file.write("\tpartition part = %s: %s;\n\tset " "partition=part;\nend;\n" % (p, ", ".join([name for name in self.partitions.get_partition_names()]))) # Write models, if any if use_nexus_models: out_file.write("\nbegin mrbayes;\n") i = 1 for name, models in self.partitions.models.items(): if models[1]: for m in models[1]: if m: m_str = self.partitions._models["mrbayes"][m] if not self.partitions.is_single(): out_file.write("applyto=({}) " "lset ".format(i) + " ".join(m_str) + "\n") else: out_file.write("lset " + " ".join(m_str) + "\n") i += 1 out_file.write("end;\n") # In case outgroup taxa are specified if outgroup_list is not None: # This assures that only the outgroups present in the current # file are written compliant_outgroups = [taxon for taxon in outgroup_list if taxon in self.taxa_list] if compliant_outgroups is not []: out_file.write("\nbegin mrbayes;\n\toutgroup %s\nend;\n" % (" ".join(compliant_outgroups))) out_file.close()
[docs] def _write_stockholm(self, output_file, **kwargs): """Writes `Alignment` object into stockholm file format. Parameters ---------- ns_pipe : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. table_suffix : string, optional Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string, optional Name of the table from where the sequence data is fetched. """ table_suffix = kwargs.get("table_suffix", None) table_name = kwargs.get("table_name", None) ns_pipe = kwargs.get("ns_pipe", None) pbar = kwargs.get("pbar", None) self._set_pipes(ns_pipe, pbar, total=len(self.taxa_list), msg="Writing stockholm output") out_file = open("{}.stockholm".format(output_file), "w") out_file.write("# STOCKHOLM V1.0\n") for p, (taxon, seq) in enumerate(self.iter_alignment( table_name=table_name, table_suffix=table_suffix)): self._update_pipes(ns_pipe, pbar, value=p + 1) out_file.write("{}\t{}\n".format(taxon, seq)) out_file.write("//\n") out_file.close()
[docs] def _write_gphocs(self, output_file, **kwargs): """Writes `Alignment` object into gphocs file format. Parameters ---------- ns_pipe : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. table_suffix : string, optional Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string, optional Name of the table from where the sequence data is fetched. """ table_suffix = kwargs.get("table_suffix", None) table_name = kwargs.get("table_name", None) ns_pipe = kwargs.get("ns_pipe", None) pbar = kwargs.get("pbar", None) out_file = open("{}.txt".format(output_file), "w") # Write number of loci out_file.write("%s\n" % (len(self.partitions.partitions))) if not self.partitions.is_single(): self._set_pipes(ns_pipe, pbar, total=len(self.partitions.partitions), msg="Writing gphocs output") for p, (name, lrange) in enumerate( self.partitions.partitions.items()): self._update_pipes(ns_pipe, pbar, value=p + 1) lrange = lrange[0] out_file.write("%s %s %s\n" % (name, len(self.taxa_list), lrange[1] - lrange[0])) for taxon, seq in self.iter_alignment( table_name=table_name, table_suffix=table_suffix): out_file.write("%s\t%s\n" % (taxon, seq[lrange[0]:lrange[1]])) else: self._set_pipes(ns_pipe, pbar, total=len(self.taxa_list), msg="Writing gphocs output") out_file.write("%s %s %s\n" % (self.sname, len(self.taxa_list), self.locus_length)) for p, (taxon, seq) in enumerate(self.iter_alignment( table_name=table_name, table_suffix=table_suffix)): self._update_pipes(ns_pipe, pbar, value=p + 1) out_file.write("%s\t%s\n" % (taxon, seq)) out_file.close()
[docs] def _write_ima2(self, output_file, **kwargs): """Writes `Alignment` object into ima2 file format. Parameters ---------- tx_space_ima2 : int, optional Space (in characters) provided for the taxon name in ima2 format (default is 10). cut_space_ima2 : int, optional Set the maximum allowed character length for taxon names in ima2 format. Longer names are truncated (default is 8). ima2_params : list A list with the additional information required for the ima2 output format. The list should contains the following information: 1. (str) path to file containing the species and populations. 2. (str) Population tree in newick format, e.g. (0,1):2. 3. (str) Mutational model for all alignments. 4. (str) inheritance scalar. ns_pipe : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. table_suffix : string, optional Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string, optional Name of the table from where the sequence data is fetched. """ tx_space_ima2 = kwargs.get("tx_space_ima2", 10) cut_space_ima2 = kwargs.get("cut_space_ima2", 8) ima2_params = kwargs.get("ima2_params", None) table_suffix = kwargs.get("table_suffix", None) table_name = kwargs.get("table_name", None) ns_pipe = kwargs.get("ns_pipe", None) pbar = kwargs.get("pbar", None) population_file = ima2_params[0] population_tree = ima2_params[1] mutational_model = ima2_params[2] inheritance_scalar = ima2_params[3] # Get information on which species belong to each population from # the populations file population_handle = open(population_file) population_storage = OrderedDict() for line in population_handle: taxon, population = line.strip().split() try: population_storage[population.strip()].append(taxon) except KeyError: population_storage[population.strip()] = [taxon] # Write the general header of the IMa2 input file out_file = open("{}.txt".format(output_file), "w") # First line with general description out_file.write("Input file for IMa2 using %s alignments\n" "%s\n" # Line with number of loci "%s\n" # Line with name of populations "%s\n" # Line with population string % (len(self.partitions.partitions), len(population_storage), " ".join(population_storage.keys()), population_tree)) if not self.partitions.is_single(): self._set_pipes(ns_pipe, pbar, total=len(self.partitions.partitions), msg="Writing IMa2 output") # Write each locus for p, (partition, lrange) in enumerate(self.partitions): self._update_pipes(ns_pipe, pbar, value=p + 1) # Retrieving taxon names and sequence data. This step is # the first because it will enable the removal of species # containing only missing data. new_alignment = [] # This temporary ordered dictionary is created so that # the number of taxa per populations is corrected in # each locus current_locus_populations = OrderedDict( (x, []) for x in population_storage) for population, taxa_list in population_storage.items(): for taxon in taxa_list: # This try statement catches common errors, such as # providing a species in the mapping file that # does not exist in the alignment try: seq = self.get_sequence( taxon, table_name=table_name, table_suffix=table_suffix)[ lrange[0][0]:lrange[0][1]].upper() except KeyError: print("Taxon %s provided in auxiliary " "population mapping file is not found " "in the alignment") raise SystemExit if seq.replace(self.sequence_code[1], "") != "": new_alignment.append((taxon[:cut_space_ima2] .ljust(tx_space_ima2), seq)) current_locus_populations[population] \ .append(taxon) # Write the header of each partition out_file.write("%s %s %s %s %s\n" % ( partition, " ".join([str(len(x)) for x in list(current_locus_populations.values())]), (lrange[0][1]) - lrange[0][0], mutational_model, inheritance_scalar)) # Write sequence data according to the order of the # population mapping file for taxon, seq in new_alignment: out_file.write("%s%s\n" % (taxon, seq)) if self.partitions.is_single(): self._set_pipes(ns_pipe, pbar, total=len(population_storage), msg="Writing IMa2 output") # Write the header for the single out_file.write("%s %s %s %s %s\n" % ( partition, " ".join(population_storage.values()), self.locus_length, mutational_model, inheritance_scalar)) # Write sequence data for p, (population, taxa_list) in enumerate( population_storage.items()): self._update_pipes(ns_pipe, pbar, value=p + 1) for taxon in taxa_list: seq = self.get_sequence( taxon, table_name=table_name, table_suffix=table_suffix).upper() out_file.write("%s%s\n" % ( taxon[:cut_space_ima2].ljust(tx_space_ima2), seq))
[docs] def _write_mcmctree(self, output_file, **kwargs): """Writes `Alignment` object into mcmctree file format. Parameters ---------- tx_space_phy : int, optional Space (in characters) provided for the taxon name in phylip format (default is 40). cut_space_phy : int, optional Set the maximum allowed character length for taxon names in phylip format. Longer names are truncated (default is 39). ns_pipe : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. table_suffix : string, optional Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string, optional Name of the table from where the sequence data is fetched. """ tx_space_phy = kwargs.get("tx_space_phy", 40) cut_space_phy = kwargs.get("cut_space_phy", 39) table_suffix = kwargs.get("table_suffix", None) table_name = kwargs.get("table_name", None) ns_pipe = kwargs.get("ns_pipe", None) pbar = kwargs.get("pbar", None) out_file = open("{}_mcmctree.phy".format(output_file), "w") taxa_number = len(self.taxa_list) if self.partitions.is_single() is False: self._set_pipes(ns_pipe, pbar, total=len(self.partitions.partitions), msg="Writing mcmctree output") for p, lrange in enumerate( self.partitions.partitions.values()): self._update_pipes(ns_pipe, pbar, value=p + 1) lrange = lrange[0] out_file.write("%s %s\n" % (taxa_number, (lrange[1] - (lrange[0])))) for taxon, seq in self.iter_alignment( table_name=table_name, table_suffix=table_suffix): out_file.write("%s %s\n" % ( taxon[:cut_space_phy].ljust( tx_space_phy), seq[lrange[0]:lrange[1]].upper())) else: self._set_pipes(ns_pipe, pbar, total=len(self.taxa_list), msg="Writing mcmctree output") out_file.write("%s %s\n" % (taxa_number, self.locus_length)) for p, (taxon, seq) in enumerate(self.iter_alignment( table_name=table_name, table_suffix=table_suffix)): self._update_pipes(ns_pipe, pbar, value=p + 1) out_file.write("%s %s\n" % ( taxon[:cut_space_phy].ljust(tx_space_phy), seq.upper())) out_file.close()
[docs] def write_to_file(self, output_format, output_file, **kwargs): """Writes alignment data into an output file. Writes the alignment object into a specified output file, automatically adding the extension, according to the specified output formats. This function supports the writing of both converted (no partitions) and concatenated (partitioned files). The choice between these modes is determined by the Partitions object associated with the Alignment object. If it contains multiple partitions, it will produce a concatenated alignment and the auxiliary partition files where necessary. Otherwise it will treat the alignment as a single partition. Parameters ---------- output_format : list List with the output formats to generate. Options are: {"fasta", "phylip", "nexus", "stockholm", "gphocs", "ima2"}. output_file : str Name of the output file. If using `ns_pipe` in TriFusion, it will prompt the user if the output file already exists. tx_space_nex : int, optional Space (in characters) provided for the taxon name in nexus format (default is 40). tx_space_phy : int, optional Space (in characters) provided for the taxon name in phylip format (default is 40). tx_space_ima2 : int, optional Space (in characters) provided for the taxon name in ima2 format (default is 10). cut_space_nex : int, optional Set the maximum allowed character length for taxon names in nexus format. Longer names are truncated (default is 39). cut_space_phy : int, optional Set the maximum allowed character length for taxon names in phylip format. Longer names are truncated (default is 39). cut_space_ima2 : int, optional Set the maximum allowed character length for taxon names in ima2 format. Longer names are truncated (default is 8). interleave : bool, optional Determines whether the output alignment will be in leave (False) or interleave (True) format. Not all output formats support this option. gap : str, optional Symbol for alignment gaps (default is '-'). model_phylip : str, optional Substitution model for the auxiliary partition file of phylip format, compliant with RAxML. outgroup_list : list, optional Specify list of taxon names that will be defined as the outgroup in the nexus output format. This may be useful for analyses with MrBayes or other software that may require outgroups. ima2_params : list, optional A list with the additional information required for the ima2 output format. The list should contains the following information: 1. (str) path to file containing the species and populations. 2. (str) Population tree in newick format, e.g. (0,1):2. 3. (str) Mutational model for all alignments. 4. (str) inheritance scalar. use_charset : bool, optional If True, partitions from the `Partitions` object will be written in the nexus output format (default is True). partition_file : bool, optional If True, the auxiliary partitions file will be written (default is True). output_dir : str, optional If provided, the output file will be written on the specified directory. phy_truncate_names : bool, optional If True, taxon names in phylip format are truncated to a maximum of 10 characters (default is False). ld_hat : bool, optional If True, the fasta output format will include a first line compliant with the format of LD Hat and will truncate sequence names and sequence length per line accordingly. use_nexus_models : bool, optional If True, writes the partitions charset block in nexus format. ns_pipe : multiprocesssing.Manager.Namespace, optional A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. table_suffix : string, optional Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string, optional Name of the table from where the sequence data is fetched. """ # Get relevant keyword arguments ns_pipe = kwargs.get("ns_pipe", None) model_phylip = kwargs.get("model_phylip", None) write_methods = { "fasta": self._write_fasta, "phylip": self._write_phylip, "nexus": self._write_nexus, "stockholm": self._write_stockholm, "gphocs": self._write_gphocs, "ima2": self._write_ima2, "mcmctree": self._write_mcmctree } # If a specific output directory is provided, the output file will be # written there output_dir = kwargs.pop("output_dir", None) if output_dir: output_file = join(output_dir, output_file) if not exists(output_dir): os.makedirs(output_dir) # Stores suffixes for each format format_ext = {"ima2": ".txt", "mcmctree": "_mcmctree.phy", "phylip": ".phy", "nexus": ".nex", "fasta": ".fas", "stockholm": ".stockholm", "gphocs": ".txt"} # Check if any output file already exist. If so, issue a warning # through the app or terminal pipes for f in output_format: fname = output_file + format_ext[f] if exists(fname): # File exists issue warning through appropriate pipe if ns_pipe: if not ns_pipe.apply_all: ns_pipe.file_dialog = fname while not ns_pipe.status: pass # When the dialog has been close, check if file is to be # skipped or overwritten if ns_pipe: if ns_pipe.status == "skip": # Skip file return # Reset pipes, if any if ns_pipe: ns_pipe.status = None # This will determine the default model value. GTR for nucleotides # and LG for proteins if not model_phylip: if self.sequence_code[0] == "DNA": model_phylip = "GTR" else: model_phylip = "LG" kwargs["model_phylip"] = model_phylip # Checks if there is any other format besides Nexus if the # alignment's gap have been coded if self.restriction_range is not None: if output_format != ["nexus"]: return 0 else: pass for fmt in output_format: write_methods[fmt](output_file, **kwargs) self._reset_pipes(ns_pipe) # Remove temporary interleave data, if present if self._table_exists(".interleavedata"): self.cur.execute("DROP TABLE [.interleavedata]")
[docs]class AlignmentList(Base): """Main interface for groups of `Alignment` objects. The `AlignmentList` class can be seen as a group of `Alignment` objects that is treated as a single cohesive data set. It is the main class used to perform alignment operations on TriFusion and TriSeq programs, even when there is only a single alignment. An instance can be created by simply providing a list of paths to alignment files, which can be empty (and alignments added later). It is recommended that the path to the sqlite database be specified via the `sql_db` argument, but this is optional. Individual alignment files can be provided when instantiating the class or added later, and each is processed and stored as an `Alignment` object. The `AlignmentList` object provides attributes that are relative to the global data set. For instance, each `Alignment` object may have their individual and unique list of taxa, but the corresponding attribute in the `AlignmentList` object contains the list of all taxa that are found in the `Alignment` object list. Several methods of the `Alignment` class are also present in this class with the same usage (e.g. `collapse`). These are basically wrappers that apply the method to all `Alignment` objects and may have some modifications adapted to multiple alignments. Other methods are exclusive and desgined to deal with multiple alignments, such as concatenation (`concatenate`), filter by taxa (`filter_by_taxa`), etc. Plotting methods are exclusive to `AlignmentList`, even the ones that are focused on single alignments. In this case, the `Alignment` object can be specified and processed individually directly from this class. Parameters ---------- alignment_list : list List with paths to alignment files. Can be empty. sql_db : str, optional Path where the sqlite3 database will be created. db_cur : sqlite3.Cursor, optional Provide a database Cursor object in conjunction with the Connection object (`db_con`) to connect to an existing database. db_con : sqlite3.Connection, optional Provide a database Connection object in conjunction with the Cursor object (`db_cur`) to connect to an existing database. pbar : ProgressBar, optional A ProgressBar object used to log the progress of TriSeq execution. Attributes ---------- con : sqlite3.Connection Connection object of the sqlite database. cur : sqlite3.Cursor Cursor object of the sqlite database. sql_path : str Path to sqlite3 database file. alignments : collections.OrderedDict Stores the 'active' `Alignment objects` as values, with the corresponding key being `Alignment.path`. shelve_alignments : collections.OrderedDict Stores the 'inactive' or 'shelved' `Alignment` objects. `AlignmentList` methods will operate only on the `alignments` attribute, unless explicitly stated otherwise. The key:value is the same as in the `alignments` attribute. bad_alignments : list Stores the `Alignment.name` attribute of badly formatted alignments. duplicate_alignments : list Stores the `Alignment.name` attribute of duplicated alignment paths. non_alignments : list Stores the `Alignment.name` attribute of sequence sets of unequal length. taxa_names : list Stores the name of all taxa across the `Alignment` object list. shelved_taxa : list Stores the 'inactive' or 'shelved' taxa. `AlignmentList` methods will operate only on the `taxa_names` attribute, unless explicitly stated otherwise. path_list : list List of `Alignment.paths` for all `Alignment` objects. filtered_alignments : collections.OrderedDict Ordered dictionary of four key:values used to count the number of `Alignment` objects that were filtered by four operations. The keys are {"By minimum taxa", "By taxa", "By variable sites", "By informative sites"}. sequence_code : tuple Contains information on (<sequence type>, <missing data symbol>), e.g. ("Protein", "x"). gap_symbol : str Symbol used to denote gaps in the alignment (default is '-'). summary_stats : dict Dictionary that stores several summary statistics that are calculated once the `get_summary_stats` method is executed. summary_gene_table : pandas.DataFrame DataFrame containing the summary statistics for each `Alignment` object. Also populated when the `get_summary_stats` method is executed. active_tables : list List of active database tables associated with the `AlignmentList` instance. partitions : trifusion.process.data.Partitions Partitions object that refers to the total `AlignmentList`. """ def __init__(self, alignment_list, sql_db=None, db_cur=None, db_con=None, pbar=None): # Create connection and cursor for sqlite database # If `db_cur` and `db_con` are both provided, setup the database # connection and ignore `sql_db` if db_cur and db_con: self.con = db_con """ Database Connection object """ self.cur = db_cur """ Database Cursor object """ # If only `sql_db` is provided, set it as the attribute elif sql_db: self.sql_path = sql_db # If no database information is provided, set a temporary database # file in the current working directory else: self.sql_path = "trifusion.sqlite3" """ Path to sqlite3 database file """ if not db_cur and not db_con: self.con = sqlite3.connect(self.sql_path, check_same_thread=False) self.cur = self.con.cursor() self.cur.execute("PRAGMA synchronous = OFF") self.alignments = OrderedDict() """ Stores the "active" `Alignment` objects for the current `AlignmentList`. Keys will be the `Alignment.path` for quick lookup of `Alignment` object values """ self.shelve_alignments = OrderedDict() """ Stores the "inactive" or "shelved" `Alignment` objects. All `AlignmentList` methods will operate only on the alignments attribute, unless explicitly stated otherwise. Key-value is the same as the `alignments` attribute. """ self.bad_alignments = [] """ Attribute list that stores the Alignment.name attribute of badly formatted alignments """ self.duplicate_alignments = [] """ Attribute list that stores duplicate Alignment.name. """ self.non_alignments = [] """ Attribute list that stores sequence sets of unequal lenght """ self.taxa_names = [] """ List with the name of the taxa included in the AlignmentList object """ self.shelved_taxa = [] """ List with non active taxa """ self.path_list = [] """ List of `Alignment.path` """ self.filtered_alignments = OrderedDict([("By minimum taxa", None), ("By taxa", None), ("By variable sites", None), ("By informative sites", None)]) """ Records the number of filtered alignments by each individual filter type """ self.sequence_code = None """ Tuple with the AlignmentList sequence code. Either ("DNA", "n") or ("Protein", "x") """ self.gap_symbol = "-" """ Dictionary with summary statistics for the active alignments """ self.summary_stats = {"genes": 0, "taxa": 0, "seq_len": 0, "gaps": 0, "avg_gaps": [], "missing": 0, "avg_missing": [], "variable": 0, "avg_var": [], "informative": 0, "avg_inf": []} """ Dictionary with summary statistics for each alignment. """ columns = ["genes", "nsites", "taxa", "var", "inf", "gap", "missing"] self.summary_gene_table = pd.DataFrame(columns=columns) """ Lists the currently active tables. This is mainly used for the conversion of the consensus alignments into a single Alignment object """ self.active_tables = [] # Set partitions object self.partitions = Partitions() # if type(alignment_list[0]) is str: if alignment_list: self.add_alignment_files(alignment_list, pbar=pbar) def __iter__(self): """Iterator behavior for `AlignmentList` Returns an iterator over the 'active' `Alignment` objects. Returns _ : iter Iterable over `alignments.values()` """ return iter(self.alignments.values())
[docs] def _create_table(self, table_name, index=None, cur=None): """Creates a new table in the database. Convenience method that creates a new table in the sqlite database. It accepts a custom Cursor object, which overrides the `cur` attribute. Optionally, an index can also be specified. Parameters ---------- table_name : str Name of the table to be created. index : list, optional Provide a list with [<name of index>, <columns to be indexed>]. (e.g., ["concindex", "txId"]). cur : sqlite3.Cursor, optional Custom Cursor object used to query the database. """ if not cur: cur = self.cur cur.execute("CREATE TABLE [{}](" "txId INT," "taxon TEXT," "seq TEXT)".format(table_name)) if index: cur.execute("CREATE INDEX {} ON [{}]({})".format( index[0], table_name, index[1]))
[docs] def _table_exists(self, table_name, cur=None): """ Checks if a table exists in the database. Convenience method that checks if a table exists in the database. It accepts a custom Cursor object, which overrides the `cur` attribute. Parameters ---------- table_name : str Name of the table. cur: sqlite3.Cursor, optional Custom Cursor object used to query the database. Returns ------- res : list List with the results of a query for 'table' type with `table_name` name. Is empty when the table does not exist. Notes ----- This returns a list that will contain one item if the table exists in the database. If it doesn't exist, returns an empty list. Therefore, this can be used simply like:: if self._table_exists("my_table"): # Do stuff """ if not cur: cur = self.cur return cur.execute( "SELECT name FROM sqlite_master WHERE type='table' AND" " name='{}'".format(table_name)).fetchall()
[docs] def _set_pipes(self, ns=None, pbar=None, total=None, msg=None): """Setup of progress indicators for both GUI and CLI task executions. This handles the setup of the objects responsible for updating the progress of task's execution of both TriFusion (GUI) and TriSeq (CLI). At the beginning of any given task, these objects can be initialized by providing either the Namespace object (`ns`) in the case of TriFusion, or the ProgressBar object (`pbar`), in the case of TriSeq. Along with one of these objects, the expected `total` of the progress should also be provided. The `ns` and `pbar` objects are updated at each iteration of a given task, and the `total` is used to get a measure of the progress. Optionally, a message can be also provided for the Namespace object that will be used by TriFusion. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. total : int Expected total of the task's progress. msg : str, optional A secondary message that appears in TriFusion's progress dialogs. Notes ----- The progress of any given task can be provided by either an `Alignment` or `AlignmentList` instance. Generally, the tasks follow the progress of the `AlignmentList` instance, unless that instance contains only one `Alignment` object. In that case, the progress information is piped from the `Alignment` instance. For that reason, and for the Namespace (`ns`) object only, this method checks if there is only one 'active' `Alignment` object. If yes, set the `Namepsace.sa` flag to True, so that the progress indicators are piped from the `Alignment` object instead. Examples -------- Start a progress counter for a task that will make 100 iterations:: self._set_pipes(ns=ns_obj, pbar=pbar_obj, total=100, msg="Some message") """ if ns: if ns.stop: raise KillByUser("") if len(self.alignments) == 1: ns.sa = True else: ns.sa = False ns.total = total ns.counter = 0 if pbar: pbar.max_value = total
@staticmethod
[docs] def _update_pipes(ns=None, pbar=None, value=None, msg=None): """Update progress indicators for both GUI and CLI task executions. This method provides a single interface for updating the progress objects `ns` or `pbar`, which should have been initialized at the beginning of the task with the `_set_pipes` method. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. value : int Value of the current progress index msg : str, optional A secondary message that appears in TriFusion's progress dialogs. Notes ----- The progress of any given task can be provided by either an `Alignment` or `AlignmentList` instance. Generally, the tasks follow the progress of the `AlignmentList` instance, unless that instance contains only one `Alignment` object. In that case, the progress information is piped from the `Alignment` instance. For that reason, and for the Namespace (`ns`) object only, if the `Namespace.sa` attribute is set to True, then to not update the progress. Progress is being piped directly from the `Alignment` object. Examples -------- Update the counter in an iteration of 100:: for i in range(100): self._update_pipes(ns=ns_obj, pbar=pbar_obj, value=i, msg="Some string") """ if ns: if ns.stop: raise KillByUser("") if not ns.sa: ns.msg = msg ns.counter = value if pbar: pbar.update(value)
@staticmethod
[docs] def _reset_pipes(ns): """Reset progress indicators for both GUI and CLI task executions. This should be done at the end of any task that initialized the progress objects, but it only affects the Namespace object. It resets all Namespace attributes to None. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. """ if ns: if ns.stop: raise KillByUser("") ns.counter = ns.total = ns.msg = ns.sa = None
[docs] def aln_names(self): """Returns list of basenames of `alignments` file paths.""" return sorted([basename(x) for x in self.alignments])
[docs] def resume_database(self): """Reconnects to the sqlite database. The connection to the database can be closes or even severed when passing objects between threads. This methods allows the reconnection of the database for the `AlignmentList` and *all* (even the shelved ones) `Alignment` objects. """ self.con = sqlite3.connect(self.sql_path, check_same_thread=False) self.cur = self.con.cursor() for aln in self.alignments.values() + self.shelve_alignments.values(): aln.cur = self.cur aln.con = self.con
[docs] def set_database_connections(self, cur, con): """Provides Connection and Cursor to `Alignment` objects. Sets the database connections manually for *all* (even the shelved ones) `Alignment` objects. Parameters ---------- cur : sqlite3.Cursor Provide a database Cursor object. con : sqlite3.Connection Provide a database Connection object. """ self.cur = cur self.con = con for aln in self.alignments.values() + self.shelve_alignments.values(): aln.cur = cur aln.con = con
[docs] def get_tables(self): """Return list with `table_name` of *all* `Alignment` objects. Returns ------- _ : list List with `Alignment.table_name`. """ return [x.table_name for x in self.alignments.values() + self.shelve_alignments.values()]
[docs] def remove_tables(self, preserve_tables=None, trash_tables=None): """Drops tables from the database. This method can be used to drop tables from the database. If no arguments are provided it will remove *all* database tables associated with the `AlignmentList` object. If the `preserve_tables` argument is provided, the table names present in this list will not be dropped. If `trash_tables` is provided, only the tables names specified in that list are dropped. If both are provided, `trash_tables` will take precedence. Parameters ---------- preserve_tables : list, optional If provided, the table names in this list will NOT be dropped. trash_tables : list, optional If provided, only the table names in this list will be dropped. Takes precedence over `preserve_tables` if both are provided. """ if not preserve_tables: preserve_tables = [] if not trash_tables: trash_tables = [] tables = self.cur.execute( "SELECT name FROM sqlite_master WHERE type='table';").fetchall() if trash_tables: tables_delete = [x[0] for x in tables if x[0] in trash_tables] else: tables_delete = [x[0] for x in tables if x[0] not in preserve_tables] for tb in tables_delete: self.cur.execute("DROP TABLE [{}]".format(tb))
[docs] def clear_alignments(self): """Clears all attributes and data from the `AlignmentList` object.""" # Drop all database tables related to the current Alignments for aln in self.alignments.values() + self.shelve_alignments.values(): aln.remove_alignment() # Drop active databases of the AlignmentList instance, namely consensus # and concatenation, if they exist for table in self.active_tables: self.cur.execute("DROP TABLE [{}]".format(table)) self.alignments = {} self.shelve_alignments = {} self.bad_alignments = [] self.duplicate_alignments = [] self.non_alignments = [] self.taxa_names = [] self.shelved_taxa = [] self.path_list = [] self.filtered_alignments = OrderedDict([("By minimum taxa", None), ("By taxa", None), ("By variable sites", None), ("By informative sites", None)]) self.sequence_code = None self.summary_stats = {"genes": 0, "taxa": 0, "seq_len": 0, "gaps": 0, "avg_gaps": [], "missing": 0, "avg_missing": [], "variable": 0, "avg_var": [], "informative": 0, "avg_inf": []} columns = ["genes", "nsites", "taxa", "var", "inf", "gap", "missing"] self.summary_gene_table = pd.DataFrame(columns=columns) self.active_tables = [] self.partitions = Partitions()
[docs] def _reset_summary_stats(self): """Resets the `summary_stats` attribute.""" self.summary_stats = {"genes": 0, "taxa": 0, "seq_len": 0, "gaps": 0, "avg_gaps": [], "missing": 0, "avg_missing": [], "variable": 0, "avg_var": [], "informative": 0, "avg_inf": []}
[docs] def update_active_alignments(self, aln_list=None, all_files=False): """Sets the active alignments. Sets the 'active' alignments stored in the `alignments` attribute and the 'inactive' alignments stored in the `shelve_alignments` attribute, based on the `aln_list` argument. Regardless of the current 'active' and 'inactive' alignments, calling this method with the `aln_list` argument will set only those as the 'active' alignments. Alternatively, the `all_files` bool argument can be provided to set all alignments as 'active'. After updating the `alignments`and `shelve_alignments` attributes, the `taxa_names` attribute is also updated. Parameters ---------- aln_list : list List of `Alignment.name` that will be set as 'active'. all_files : bool If True, all alignments become active. Ignores `aln_list`. """ if all_files: for aln in self.shelve_alignments: self.alignments[aln] = self.shelve_alignments[aln] elif aln_list is not None: for aln in self.alignments.keys() + self.shelve_alignments.keys(): if aln not in aln_list: try: self.shelve_alignments[aln] = self.alignments[aln] del self.alignments[aln] except KeyError: pass else: try: self.alignments[aln] = self.shelve_alignments[aln] del self.shelve_alignments[aln] except KeyError: pass # Update taxa names self.taxa_names = self._get_taxa_list()
[docs] def update_active_alignment(self, aln_name, direction): """Updates the 'active' status of a single alignment. Changes the 'active' status of a single alignment in a specified direction. After updating the `alignments`and `shelve_alignments` attributes, the `taxa_names` attribute is also updated. Parameters ---------- aln_name : str `Alignment.name` to update. direction : str Direction where `aln_name` will be moved. Can be {"shelve", "active"}. """ if direction == "shelve": self.shelve_alignments[aln_name] = self.alignments[aln_name] del self.alignments[aln_name] else: self.alignments[aln_name] = self.shelve_alignments[aln_name] del self.shelve_alignments[aln_name] # Update taxa names self.taxa_names = self._get_taxa_list()
[docs] def update_taxa_names(self, taxa_list=None, all_taxa=False): """Sets the active taxa. Sets the 'active' taxa stored in the `taxa_names` attribute and the 'inactive' taxa stored in the `shelved_taxa` attribute, based on the `taxa_list` argument. Alternatively, the `all_taxa` bool argument can be provided to set all taxa as 'active'. Parameters ---------- taxa_list : list List of taxon names that will be set as 'active'. all_taxa : bool If True, all alignments become active. Ignores `taxa_list`. """ # Activate all taxa if all_taxa: for tx in self.shelved_taxa: self.taxa_names.append(tx) self.shelved_taxa = [] # Activate only taxa specified by taxa_list elif taxa_list: for tx in self.taxa_names + self.shelved_taxa: if tx not in taxa_list and tx in self.taxa_names: try: self.taxa_names.remove(tx) self.shelved_taxa.append(tx) except ValueError: pass elif tx in taxa_list and tx in self.shelved_taxa: try: self.taxa_names.append(tx) self.shelved_taxa.remove(tx) except ValueError: pass # Update individual Alignment objects for aln_obj in self.alignments.values(): aln_obj.shelve_taxa(self.shelved_taxa)
[docs] def format_list(self): """Returns list of unique sequence types from `Alignment` objects. Returns ------- _ : list List with unique sequence types for `Alignment` objects as strings. """ return list(set([x.sequence_code[0] for x in self.alignments.values() if x]))
[docs] def _get_taxa_list(self, only_active=False): """Gets the global taxa list from all alignments. If called with no arguments, it returns the taxa list from *all* alignments, including the 'inactive' ones. The `only_active` argument can be set to True to return only the taxa list from the 'active' files. Parameters ---------- only_active : bool If True, only returns the taxa list from the 'active' alignments. Returns ------- full_taxa : list List with taxon names as strings. """ if only_active: full_taxa = list(set().union(*[x.taxa_list for x in self.alignments.values()])) else: full_taxa = list(set().union(*[x.taxa_list for x in self.alignments.values() + self.shelve_alignments.values()])) return full_taxa
[docs] def _get_filename_list(self): """Returns list with the `Alignment.name` of alignments. Returns ------- _ : list List of `Alignment.name` of alignments. """ return [alignment.name for alignment in self.alignments.values()]
[docs] def set_partition_from_alignment(self, alignment_obj): """Updates `partitions` object with the provided `Alignment` object. Uses the `partitions` of an `Alignment` object to set the `partitions` attribute of `AlignmentList` Parameters ---------- alignment_obj : trifusion.process.sequence.Alignment `Alignment` object. """ # Update partitions object # NOTE: The use_counter argument is set to False here so that when # the locus_range is provided, the counter does not interfere with the # ranges. if not alignment_obj.partitions.is_single(): for k, v in alignment_obj.partitions: self.partitions.add_partition( k, locus_range=v[0], codon=v[1], use_counter=True, file_name=alignment_obj.path, model_cls=alignment_obj.partitions.models[k]) else: self.partitions.add_partition( alignment_obj.name, use_counter=True, file_name=alignment_obj.path, length=alignment_obj.locus_length, model_cls=alignment_obj.partitions.models[alignment_obj.name])
[docs] def add_alignments(self, alignment_obj_list, ignore_paths=False): """Add a list of `Alignment` objects to the current `AlignmentList`. Adds a list of alignments, already as `Alignment` objects, to the current `AlignmentList`. This method performs several checks to assure that the `Alignment` objects being added are correct. If all checks out, updates all attributes of the class with the new alignments. The `ignore_paths` argument can be used to prevent `Alignment` objects with short paths (only basename, for instance) from being added if they have the same basename and another in the `alignments` dict. Parameters ---------- alignment_obj_list : list List with `Alignment` objects. ignore_paths : bool If True, the `Alignment.path` of the new alignments are not checked with the loaded alignments. """ for alignment_obj in alignment_obj_list: # Check for badly formatted alignments if isinstance(alignment_obj.e, InputError): self.bad_alignments.append(alignment_obj.path) elif isinstance(alignment_obj.e, AlignmentUnequalLength): self.non_alignments.append(alignment_obj.path) # if not ignore_paths: if not ignore_paths: if alignment_obj.path in [x.path for x in self.alignments.values()]: self.duplicate_alignments.append(alignment_obj.name) else: # Get seq code if not self.sequence_code: self.sequence_code = alignment_obj.sequence_code self.alignments[alignment_obj.path] = alignment_obj self.set_partition_from_alignment(alignment_obj) else: # Get seq code if not self.sequence_code: self.sequence_code = alignment_obj.sequence_code self.alignments[alignment_obj.name] = alignment_obj self.set_partition_from_alignment(alignment_obj) # self.filename_list.append(alignment_obj.name) self.taxa_names = self._get_taxa_list()
[docs] def add_alignment_files(self, file_name_list, pbar=None, ns=None): """Adds a list of alignment files to the current `AlignmentList`. Adds a list of alignment paths to the current `AlignmentList`. Each path in the list is used to instantiate an `Alignment` object and several checks are performed to ensure that the alignments are correct and compliant with the other members of the `AlignmentList` object. Parameters ---------- file_name_list : list List with paths to sequence alignment files. ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. """ # Check for duplicates among current file list for f in [x for x, y in Counter(file_name_list).items() if y > 1]: self.duplicate_alignments.append(f) file_name_list.remove(f) # Check for duplicates between current file list and previous file list for i in set(self.path_list).intersection(set(file_name_list)): self.duplicate_alignments.append(i) file_name_list.remove(i) if ns: ns.progress = 0 if pbar: pbar.max_value = len(file_name_list) for p, aln_path in enumerate(file_name_list): # Progress bar update for command line version if pbar: pbar.update(p + 1) if ns: ns.progress += 1 ns.m = "Processing file {}".format( basename(aln_path)) if ns.stop: raise KillByUser("Child thread killed by user") aln_obj = Alignment(aln_path, sql_cursor=self.cur, sql_con=self.con) if isinstance(aln_obj.e, InputError): self.bad_alignments.append(aln_obj.path) elif isinstance(aln_obj.e, AlignmentUnequalLength): self.non_alignments.append(aln_obj.path) elif isinstance(aln_obj.e, EmptyAlignment): self.bad_alignments.append(aln_obj.path) else: # Get seq code if not self.sequence_code: self.sequence_code = aln_obj.sequence_code # Check for multiple sequence types. If True, # raise Exception elif self.sequence_code[0] != aln_obj.sequence_code[0]: raise MultipleSequenceTypes("Multiple sequence " "types detected: {} and {}".format( self.sequence_code[0], aln_obj.sequence_code[0])) self.alignments[aln_obj.path] = aln_obj self.set_partition_from_alignment(aln_obj) self.path_list.append(aln_obj.path) if self.alignments: self.taxa_names = self._get_taxa_list()
[docs] def retrieve_alignment(self, name): """Return `Alignment` object with a given `name`. Retrieves the `Alignment` object with a matching `name` attribute from `alignments` or `shelve_alignments`. Returns None if no match is found. Parameters ---------- name : str `name` attribute of an `Alignment` object. Returns ------- _ : trifusion.process.sequence.Alignment `Alignment` object. """ if name in self.alignments: return self.alignments[name] elif name in self.shelve_alignments: return self.shelve_alignments[name] else: return None
[docs] def iter_alignment_files(self): """Returns an iterable on `Alignment.path` of each alignment. Returns ------- _ : iter Iterable over the `Alignment.path` attribute of each alignment. """ return iter(alignment.path for alignment in self.alignments.values())
[docs] def write_taxa_to_file(self, file_name="Taxa_list.csv"): """Writes the taxa in `taxa_list` to a file. Parameters ---------- file_name : string Path to the generated file (default is 'Taxa_list.csv'). """ output_handle = open(file_name, "w") for taxon in self.taxa_names: output_handle.write(taxon + "\n") output_handle.close()
[docs] def concatenate(self, table_in="", ns=None, pbar=None): """Concatenates alignments into a single `Alignment` object. Concatenates multiple sequence alignments creating a single `Alignment` object and the auxiliary Partitions object defining the partitions of the concatenated alignment Parameters ---------- table_in : string Name of database table containing the alignment data that is used for this operation. ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. Returns ------- concatenated_alignment : trifusion.process.sequence.Alignment The `Alignment` object with the concatenated data and partitions. Notes ----- Since sqlite queries are quite expensive, the previous approach of querying the sequence for each taxa AND alignment table was dropped. In this approach, sqlite will do all the heavy lifting. First, a temporary table with the same definition as all alignment tables is populated with data from all taxa and alignments. It is important that and index is created on the new txId column, which will redifine the txId of individual alignments according to the global `taxa_names` attribute. In the first procedure, there will be only one query per alignment. When the temporary table is complete, some sql operations are used to group sequences from each taxon for all alignments and then concatenated them, all in a single query. This query returns the concatenated sequences, corrected txId and taxon information, ready to populate the final concatenation table. This approach is an order o magnitude faster than the previous one where thousands of queries could be performed. """ self._set_pipes(ns, pbar, total=len(self.alignments)) # Create table temporary table and create an index temp_table = "concatenation_temp" self._create_table(temp_table, index=["concidex", "txId"]) # Variables that will store the taxa_list and taxa_idx that will be # provided when instantiating an Alignment object taxa_list = self.taxa_names taxa_idx = dict((tx, idx) for idx, tx in enumerate(self.taxa_names)) # Create new cursor to insert data into database while the main # cursor returns the query conc_cur = self.con.cursor() for p, aln in enumerate(self.alignments.values()): # Update progress information self._update_pipes(ns, pbar, value=p + 1, msg="Processing alignment {}".format(aln.name)) # For each taxon present in the current Alignment object, # update the txId in the new table using the taxa_idx local # variable. for tx, seq in aln.iter_alignment(table_suffix=table_in): conc_cur.execute("INSERT INTO [{}] VALUES (?, ?, ?)".format( temp_table), (taxa_idx[tx], tx, seq)) # Get all the taxa missing from the current Alignment object missing_tx = set(self.taxa_names) - set(aln.taxa_list) # Insert missing data for the missing taxa in the teporary database for tx in missing_tx: conc_cur.execute("INSERT INTO [{}] VALUES (?, ?, ?)".format( temp_table), (taxa_idx[tx], tx, aln.sequence_code[1] * aln.locus_length)) # Set the name of the final concatenated table table = "concatenation" # Create the table self._create_table(table) self.active_tables.append(table) # Reset progress information for next loop self._reset_pipes(ns) self._set_pipes(ns, pbar, total=len(self.taxa_names)) # Here is where sqlite will do the heavy lifting of grouping sequences # from the same taxon across all alignments and then concatenating # them. It is important that the column that will be grouped has # an index (for perfomance and memory reasons). While seqs are # grouped by txId, the GROUP_CONCAT() method is used to return # concatenated strings. for p, (idx, tx, seq) in enumerate(self.cur.execute( "SELECT txId, taxon, GROUP_CONCAT(seq, '') FROM [{}] " "GROUP BY txId".format(temp_table))): self._update_pipes(ns, pbar, value=p + 1, msg="Concatenating taxon {}".format(tx)) conc_cur.execute("INSERT INTO [{}] VALUES (?, ?, ?)".format( table), (idx, tx, seq)) # Variable that will store the length of the concatenated alignment # and provided it when initializing the Alignment object locus_length = len(seq) # DROP the temporary table self.cur.execute("DROP TABLE [{}]".format(temp_table)) # Removes partitions that are currently in the shelve for aln_obj in self.shelve_alignments.values(): try: self.partitions.remove_partition(file_name=aln_obj.path) except PartitionException: pass # Create the concatenated file in an Alignment object concatenated_alignment = Alignment(table, partitions=self.partitions, locus_length=locus_length, sql_cursor=self.cur, sql_con=self.con, sequence_code=self.sequence_code, taxa_list=taxa_list, taxa_idx=taxa_idx) return concatenated_alignment
[docs] def filter_min_taxa(self, min_taxa, ns=None, pbar=None): """Filters `alignments` by minimum taxa proportion. Filters `Alignment` objects based on a minimum taxa representation threshold. Alignments with less that the specified minimum taxa percentage will be moved to the `filtered_alignments` attribute. NOTE: Since this filtering is meant to be performed when executing the process operations, it will permanently change the AlignmentList object, which means both self.alignments and self.partitions. Not doing so and removing/adding the partitions would create a great deal of conflicts that can be easily avoided by simply copying the AlignmentList object and modifying this object for the process execution Parameters ---------- min_taxa : int Percentage of minimum allowed taxa representation (e.g. 50). ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. """ self._set_pipes(ns, pbar, total=len(self.alignments)) self.filtered_alignments["By minimum taxa"] = 0 for p, (k, alignment_obj) in enumerate(list(self.alignments.items())): self._update_pipes(ns, pbar, value=p + 1, msg="Evaluating file {}".format( basename(alignment_obj.name))) if len(alignment_obj.taxa_list) < \ (float(min_taxa) / 100.) * len(self.taxa_names): self.update_active_alignment(k, "shelve") self.partitions.remove_partition(file_name=alignment_obj.path) self.filtered_alignments["By minimum taxa"] += 1 self._reset_pipes(ns)
[docs] def filter_by_taxa(self, taxa_list, filter_mode, ns=None, pbar=None): """Filters `alignments` if they contain or exclude certain taxa. Filters the `alignments` attribute by a given taxa list. The filtering may be performed to filter alignments depending on whether they include or exclude certain taxa. Parameters ---------- taxa_list : list List of taxa names. filter_mode : str Filter mode. Can be "contain" or "exclude". ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. """ self._set_pipes(ns, pbar, total=len(self.alignments)) self.filtered_alignments["By taxa"] = 0 # Support automatic file detection if taxa_list is a string if isinstance(taxa_list, str): try: file_handle = open(taxa_list) taxa_list = self.read_basic_csv(file_handle) except IOError: pass elif len(taxa_list) == 1: try: file_handle = open(taxa_list[0]) taxa_list = self.read_basic_csv(file_handle) except IOError: pass for p, (k, alignment_obj) in enumerate(list(self.alignments.items())): self._update_pipes(ns, pbar, value=p + 1, msg="Filtering file {}".format( basename(alignment_obj.name))) # Filter alignments that do not contain at least all taxa in # taxa_list if filter_mode.lower() == "contain": if set(taxa_list) - set(alignment_obj.taxa_list) != set(): self.update_active_alignment(k, "shelve") self.partitions.remove_partition( file_name=alignment_obj.path) self.filtered_alignments["By taxa"] += 1 # Filter alignments that contain the taxa in taxa list if filter_mode.lower() == "exclude": if any((x for x in taxa_list if x in alignment_obj.taxa_list)): self.update_active_alignment(k, "shelve") self.partitions.remove_partition( file_name=alignment_obj.path) self.filtered_alignments["By taxa"] += 1 self.taxa_names = self._get_taxa_list(only_active=True) # If the resulting alignment is empty, raise an Exception if self.alignments == {}: if ns: ns.exception = "EmptyAlignment" raise EmptyAlignment("Alignment is empty after taxa filter") self._reset_pipes(ns)
[docs] def filter_codon_positions(self, *args, **kwargs): """Filters codon positions in each `Alignment` object. This wraps the execution of `filter_codon_positions` method for each `Alignment` object in the `alignments` attribute. Parameters ---------- position_list : list List of three bool elements that correspond to each codon position. Ex. [True, True, True] will save all positions while [True, True, False] will exclude the third codon position table_in : string Name of database table containing the alignment data that is used for this operation. table_out : string Name of database table where the final alignment will be inserted (default is "filter"). use_main_table : bool If True, both `table_in` and `table_out` are ignore and the main table `Alignment.table_name` is used as the input and output table (default is False). ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. See Also -------- Alignment.filter_codon_positions """ ns = kwargs.get("ns", None) pbar = kwargs.pop("pbar", None) self._set_pipes(ns, pbar, total=len(self.alignments)) # Reset partitions self.partitions = Partitions() for p, alignment_obj in enumerate(list(self.alignments.values())): self._update_pipes(ns, pbar, value=p + 1, msg="Filtering file {}".format( basename(alignment_obj.name))) alignment_obj.filter_codon_positions(*args, **kwargs) self.set_partition_from_alignment(alignment_obj) self._reset_pipes(ns)
[docs] def filter_missing_data(self, *args, **kwargs): """Filters missing data in each `Alignment` object. This wraps the execution of `filter_missing_data` method for each `Alignment` object in the `alignments` attribute. Parameters ---------- gap_threshold : int Integer between 0 and 100 defining the percentage above which a column with that gap percentage is removed. missing_threshold : int Integer between 0 and 100 defining the percentage above which a column with that gap+missing percentage is removed. table_in : string Name of database table containing the alignment data that is used for this operation. table_out : string Name of database table where the final alignment will be inserted (default is "filter"). use_main_table : bool If True, both `table_in` and `table_out` are ignore and the main table `Alignment.table_name` is used as the input and output table (default is False). ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. See Also -------- Alignment.filter_missing_data """ ns = kwargs.get("ns", None) pbar = kwargs.pop("pbar", None) # Reset partitions self.partitions = Partitions() self._set_pipes(ns, pbar, total=len(self.alignments)) for p, alignment_obj in enumerate(list(self.alignments.values())): self._update_pipes(ns, pbar, value=p + 1, msg="Filtering file {}".format( basename(alignment_obj.name))) alignment_obj.filter_missing_data(*args, **kwargs) self.set_partition_from_alignment(alignment_obj) self._reset_pipes(ns)
[docs] def filter_segregating_sites(self, *args, **kwargs): """Filters `Alignment` objects according to segregating sites number. Filters `alignments` according to whether or not their number of segregating sites is within a specified range. `Alignment` objects with a number of segregating sites outside the range are move to the `shelve_alignments` and the counter of `filtered_alignments` is updated. Parameters ---------- min_val : int Minimum number of segregating sites for the alignment to pass. Can be None, in which case there is no lower bound. max_val : int Maximum number of segregating sites for the alignment to pass. Can be None, in which case there is no upper bound. table_in : string Name of database table containing the alignment data that is used for this operation. ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. See Also -------- Alignment.filter_segregating_sites """ ns = kwargs.get("ns", None) pbar = kwargs.pop("pbar", None) self._set_pipes(ns, pbar, total=len(self.alignments)) self.filtered_alignments["By variable sites"] = 0 for p, (k, alignment_obj) in enumerate(list(self.alignments.items())): self._update_pipes(ns, pbar, value=p + 1, msg="Filtering file {}".format( basename(alignment_obj.name))) if not alignment_obj.filter_segregating_sites(*args, **kwargs): self.update_active_alignment(k, "shelve") self.partitions.remove_partition(file_name=alignment_obj.path) self.filtered_alignments["By variable sites"] += 1 self._reset_pipes(ns)
[docs] def filter_informative_sites(self, *args, **kwargs): """Filters `Alignment` objects according to informative sites number. Filters `alignments` according to whether or not their number of informative sites is within a specified range. `Alignment` objects with a number of informative sites outside the range are move to the `shelve_alignments` and the counter of `filtered_alignments` is updated. Parameters ---------- min_val : int Minimum number of informative sites for the alignment to pass. Can be None, in which case there is no lower bound. max_val : int Maximum number of informative sites for the alignment to pass. Can be None, in which case there is no upper bound. table_in : string Name of database table containing the alignment data that is used for this operation. ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. See Also -------- Alignment.filter_informative_sites """ ns = kwargs.get("ns", None) pbar = kwargs.pop("pbar", None) self._set_pipes(ns, pbar, total=len(self.alignments)) self.filtered_alignments["By informative sites"] = 0 for p, (k, alignment_obj) in enumerate(list(self.alignments.items())): self._update_pipes(ns, pbar, value=p + 1, msg="Filtering file {}".format( basename(alignment_obj.name))) if not alignment_obj.filter_informative_sites(*args, **kwargs): self.update_active_alignment(k, "shelve") self.partitions.remove_partition(file_name=alignment_obj.path) self.filtered_alignments["By informative sites"] += 1 self._reset_pipes(ns)
[docs] def remove_taxa(self, taxa_list, mode="remove"): """Removes the specified taxa. Removes the specified taxa from the global `AlignmentList` attributes and from each `Alignment` object. This method supports a list or path to CSV file as the `taxa_list_file` argument. In case of CSV path, it parses the CSV file into a list. The CSV file should be a simple text file containing taxon names in each line. This method also supports two removal modes. - remove: removes the specified taxa - inverse: removes all but the specified taxa Parameters ---------- taxa_list_file : list or string List containing taxa names or string with path to a CSV file containig the taxa names. mode : {"remove", "inverse"} The removal mode (default is remove). See Also -------- Alignment.remove_taxa """ # Checking if taxa_list is an input csv file: try: file_handle = open(taxa_list[0]) taxa_list = self.read_basic_csv(file_handle) # If not, then the method's argument is already the final list except (IOError, IndexError): pass for alignment_obj in list(self.alignments.values()): alignment_obj.remove_taxa(taxa_list, mode=mode) # Updates taxa names if mode == "remove": for tx in taxa_list: try: self.taxa_names.remove(tx) except ValueError: # TODO: log a warning pass elif mode == "inverse": self.taxa_names = [tx for tx in taxa_list if tx in self.taxa_names]
[docs] def change_taxon_name(self, old_name, new_name): """Changes the name of a taxon. """ for alignment_obj in list(self.alignments.values()): alignment_obj.change_taxon_name(old_name, new_name) # update taxa names self.taxa_names = [new_name if x == old_name else x for x in self.taxa_names]
[docs] def remove_file(self, filename_list): """Removes alignments. Removes `Alignment` objects from `alignments` and `shelve_alignments` based on their `name` attribute. Parameters ---------- filename_list : list List of `Alignment.name` to be removed. """ # If filename_list corresponds to all files in the current alignment # list, dispatch the clear_alignments methods if set(self.path_list) - set(filename_list) == set([]): self.clear_alignments() for nm in filename_list: if nm in self.alignments: self.alignments[nm].remove_alignment() del self.alignments[nm] elif nm in self.shelve_alignments: self.shelve_alignments[nm].remove_alignment() del self.shelve_alignments[nm] try: self.partitions.remove_partition(file_name=nm) except PartitionException: pass # Updates taxa names self.taxa_names = self._get_taxa_list()
[docs] def select_by_taxa(self, taxa_list, mode="strict"): """Selects a list of `Alignment` objects by taxa. This method is used to select `Alignments` based on whether they contain or exclude certain taxa. The selected alignments are returned in a list. Parameters ---------- taxa_list : list List of taxa names. mode : str Selection mode. Can be: - "strict": Selects alignments containing all and only the provide taxa. - "inclusive": Selects alignments containing the provided taxa. - "relaxed": Selects alignments if they contain at least one of the provided taxa. Returns ------- selected_alignments : list List of `Alignment` objects. """ selected_alignments = [] # taxa_list may be a file name (string) or a list containing the name # of the taxa. If taxa_list is a file name this code will parse the # csv file and return a list of the taxa. Otherwise, the taxa_list # variable remains the same. try: file_handle = open("".join(taxa_list)) taxa_list = self.read_basic_csv(file_handle) except EnvironmentError: pass for alignment_obj in self.alignments.values(): alignment_taxa = alignment_obj.taxa_list # Selected only the alignments with the exact same taxa if mode == "strict": if set(taxa_list) == set(alignment_taxa): selected_alignments.append(alignment_obj) # Selected alignments that include the specified taxa if mode == "inclusive": if set(taxa_list) - set(alignment_taxa) == set(): selected_alignments.append(alignment_obj) if mode == "relaxed": for taxon in taxa_list: if taxon in alignment_taxa: selected_alignments.append(alignment_obj) continue return selected_alignments
[docs] def code_gaps(self, *args, **kwargs): """Code gaps in each `Alignment` object. This wraps the execution of `code_gaps` method for each `Alignment` object in the `alignments` attribute. Parameters ---------- table_in : string Name of database table containing the alignment data that is used for this operation. table_out : string Name of database table where the final alignment will be inserted (default is "gaps"). use_main_table : bool If True, both `table_in` and `table_out` are ignore and the main table `Alignment.table_name` is used as the input and output table (default is False). ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. See Also -------- Alignment.code_gaps """ ns = kwargs.get("ns", None) pbar = kwargs.pop("pbar", None) self._set_pipes(ns, pbar, total=len(self.alignments)) for p, alignment_obj in enumerate(self.alignments.values()): self._update_pipes(ns, pbar, value=p + 1, msg="Coding file {}".format( basename(alignment_obj.name))) alignment_obj.code_gaps(*args, **kwargs) self._reset_pipes(ns)
[docs] def collapse(self, *args, **kwargs): """Collapses equal sequences for each `Alignment` object. This wraps the execution of `collapse` method for each `Alignment` object in the `alignments` attribute. Parameters ---------- write_haplotypes : bool If True, a file mapping the taxa names name to their respective haplotype name will be generated (default is True). haplotypes_file : string Name of the file mapping taxa names to their respective haplotype. Only used when `write_haplotypes` is True. If it not provided and `write_haplotypes` is True, the file name will be determined from `Alignment.name`. dest : string Path of directory where `haplotypes_file` will be generated (default is "."). conversion_suffix : string Suffix appended to the haplotypes file. Only used when `write_haplotypes` is True and `haplotypes_file` is None. haplotype_name : string Prefix of the haplotype string. The final haplotype name will be `haplotype_name` + <int> (default is "Hap"). table_in : string Name of database table containing the alignment data that is used for this operation. table_out : string Name of database table where the final alignment will be inserted (default is "collapsed"). use_main_table : bool If True, both `table_in` and `table_out` are ignore and the main table `Alignment.table_name` is used as the input and output table (default is False). ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. See Also -------- Alignment.collapse """ ns = kwargs.get("ns", None) pbar = kwargs.get("pbar", None) write_haplotypes = kwargs.get("write_haplotypes", None) conversion_suffix = kwargs.get("conversion_suffix", None) haplotypes_file = kwargs.get("haplotypes_file", None) self._set_pipes(ns, pbar, total=len(self.alignments)) for p, alignment_obj in enumerate(self.alignments.values()): self._update_pipes(ns, pbar, value=p + 1, msg="Collapsing file {}".format( basename(alignment_obj.name))) if write_haplotypes: # Set name for haplotypes file output_file = alignment_obj.name.split(".")[0] + \ conversion_suffix + haplotypes_file alignment_obj.collapse(haplotypes_file=output_file, *args, **kwargs) else: alignment_obj.collapse(*args, **kwargs) self._reset_pipes(ns)
[docs] def consensus(self, *args, **kwargs): """Creates consensus sequences for each alignment. This wraps the execution of `consensus` method for each `Alignment` object in the `alignments` attribute. It has an additional argument, `single_file`, which can be used to merge the consensus sequence from each alignment into a new `Alignment` object. Parameters ---------- consensus_type : {"IUPAC", "Soft mask", "Remove", "First sequence"} Type of variation handling. See summary above. single_file : bool If True, the consensus sequence of each `Alignment` object will be merged into a single `Alignment` object. table_in : string Name of database table containing the alignment data that is used for this operation. table_out : string Name of database table where the final alignment will be inserted (default is "consensus"). use_main_table : bool If True, both `table_in` and `table_out` are ignore and the main table `Alignment.table_name` is used as the input and output table (default is False). ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. Returns consensus_aln : trifusion.process.sequence.Alignment Alignment object with all the consensus sequences merged. It's only returned when the `single_file` parameter is set to True. See Also -------- Alignment.consensus """ ns = kwargs.get("ns", None) pbar = kwargs.get("pbar", None) single_file = kwargs.pop("single_file", None) self._set_pipes(ns, pbar, len(self.alignments)) self.taxa_names = ["consensus"] # Variables that will store the taxa_list and taxa_idx to provide # to the Alignment object taxa_list = [] taxa_idx = {} if single_file: # Create a table that will harbor the consensus sequences of all # Alignment objects self._create_table("consensus") self.active_tables.append("consensus") consensus_data = [] for p, alignment_obj in enumerate(self.alignments.values()): self._update_pipes(ns, pbar, value=p + 1, msg="Processing file {}".format( basename(alignment_obj.name))) alignment_obj.consensus(*args, **kwargs) if single_file: table_out = kwargs.get("table_out", None) sequence = alignment_obj.get_sequence("consensus", table_suffix=table_out) consensus_data.append((p, alignment_obj.sname, sequence)) taxa_list.append(alignment_obj.sname) taxa_idx[alignment_obj.sname] = p self._reset_pipes(ns) if single_file: use_main_table = kwargs.get("use_main_table", None) if use_main_table: self.remove_tables(preserve_tables=["consensus"]) # Populate database table self.cur.executemany("INSERT INTO [{}] VALUES (?, ?, ?)".format( "consensus"), consensus_data) # Create Alignment object consensus_aln = Alignment("consensus", sql_cursor=self.cur, sql_con=self.con, sequence_code=self.sequence_code, taxa_list=taxa_list, taxa_idx=taxa_idx) return consensus_aln
[docs] def reverse_concatenate(self, ns=None): """Reverse a concatenated file according to the partitions. This is basically a wrapper of the `reverse_concatenate` method of `Alignment` and is rarely useful. It is preferable to retrieve an `Alignment` object, provide some partitions and then perform the reverse concatenation. Since the reverse concatenation operation can only be applied to `Alignment` objects, this method actually creates a concatenated `Alignment` . Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- reverted_alns : AlignmentList AlignmentList with the partitions as `Alignment` objects. """ concatenated_aln = self.concatenate(alignment_name="concatenated") reverted_alns = concatenated_aln.reverse_concatenate(db_con=self.con, ns=ns) return reverted_alns
[docs] def write_to_file(self, output_format, conversion_suffix="", output_suffix="", *args, **kwargs): """Writes `Alignment` objects into files. Wrapper of the `write_to_file` method of the `Alignment` object. Parameters ---------- output_format : list List with the output formats to generate. Options are: {"fasta", "phylip", "nexus", "stockholm", "gphocs", "ima2"}. output_file : str Name of the output file. If using `ns_pipe` in TriFusion, it will prompt the user if the output file already exists. tx_space_nex : int Space (in characters) provided for the taxon name in nexus format (default is 40). tx_space_phy : int Space (in characters) provided for the taxon name in phylip format (default is 40). tx_space_ima2 : int Space (in characters) provided for the taxon name in ima2 format (default is 10). cut_space_nex : int Set the maximum allowed character length for taxon names in nexus format. Longer names are truncated (default is 39). cut_space_phy : int Set the maximum allowed character length for taxon names in phylip format. Longer names are truncated (default is 39). cut_space_ima2 : int Set the maximum allowed character length for taxon names in ima2 format. Longer names are truncated (default is 8). interleave : bool Determines whether the output alignment will be in leave (False) or interleave (True) format. Not all output formats support this option. gap : str Symbol for alignment gaps (default is '-'). model_phylip : str Substitution model for the auxiliary partition file of phylip format, compliant with RAxML. outgroup_list : list Specify list of taxon names that will be defined as the outgroup in the nexus output format. This may be useful for analyses with MrBayes or other software that may require outgroups. ima2_params : list A list with the additional information required for the ima2 output format. The list should contains the following information: 1. (str) path to file containing the species and populations. 2. (str) Population tree in newick format, e.g. (0,1):2. 3. (str) Mutational model for all alignments. 4. (str) inheritance scalar. use_charset : bool If True, partitions from the `Partitions` object will be written in the nexus output format (default is True). partition_file : bool If True, the auxiliary partitions file will be written (default is True). output_dir : str If provided, the output file will be written on the specified directory. phy_truncate_names : bool If True, taxon names in phylip format are truncated to a maximum of 10 characters (default is False). ld_hat : bool If True, the fasta output format will include a first line compliant with the format of LD Hat and will truncate sequence names and sequence length per line accordingly. use_nexus_models : bool If True, writes the partitions charset block in nexus format. ns_pipe : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. pbar : ProgressBar A ProgressBar object used to log the progress of TriSeq execution. table_suffix : string Suffix of the table from where the sequence data is fetched. The suffix is append to `Alignment.table_name`. table_name : string Name of the table from where the sequence data is fetched. """ ns_pipe = kwargs.get("ns_pipe", None) pbar = kwargs.pop("pbar", None) self._set_pipes(ns_pipe, pbar, len(self.alignments)) for p, alignment_obj in enumerate(self.alignments.values()): self._update_pipes(ns_pipe, pbar, value=p + 1, msg="Writting file {}".format( basename(alignment_obj.name))) output_file_name = alignment_obj.name.split(".")[0] + \ conversion_suffix + output_suffix kwargs["output_file"] = output_file_name # Get partition name for current alignment object try: part_name = [x for x, y in self.partitions.partitions_alignments.items() if alignment_obj.path in y][0] except IndexError: try: part_name = [x for x, y in self.partitions.partitions_alignments.items() if alignment_obj.name in y][0] except IndexError: part_name = None # Get model from partitions if part_name: m = self.partitions.models[part_name] alignment_obj.partitions.set_model(part_name, m[1]) alignment_obj.write_to_file(output_format, *args, **kwargs) self._reset_pipes(ns_pipe)
[docs] def get_gene_table_stats(self, active_alignments=None, sortby=None, ascending=True): """Returns summary statistics for each `Alignment`. Gets summary statistics for each individual gene for the gene table view of TriFusion. Summary statistics have already been calculated in get_summary_stats, so here we only get the active alignments for showing. Therefore, this function should only be called after get_summary_stats. Parameters ---------- active_alignments : list List of `Alignment.name` objects to show sortby : str Table header used for sorting. Can be {"nsites", "taxa", "var", "inf", "gap", "missing"}. ascending : bool If True, the `sortby` column will be sorted in ascending order (default is True). Returns ------- summary_gene_table : pandas.DataFrame DataFrame with summary statistics for each alignment. table : list List with the data for creating .csv tables. """ # Set table header table = [["Gene name", "Number of sites", "Number of taxa", "Variable sites", "Informative sites", "Gaps", "Missing data"]] # Table line keys matching summary_gene_table dict values tl = ["nsites", "taxa", "var", "inf", "gap", "missing"] # Update the active alignments if active_alignments and active_alignments != list( self.alignments.keys()): self.update_active_alignments(active_alignments) # Filter table with active alignments summary_gene_table = self.summary_gene_table[ self.summary_gene_table["genes"].isin(self.aln_names())] if not sortby: sortby = ["genes"] summary_gene_table.sort_values(sortby, ascending=ascending, inplace=True) # Populate table information for k in self.summary_gene_table.itertuples(): # Add table line table.append(list(k[1:])) return summary_gene_table, table
# Stats methods
[docs] def get_summary_stats(self, active_alignments=None, ns=None): """Calculates summary statistics for the 'active' alignments. Creates/Updates summary statistics for the active alignments. Parameters ---------- active_alignments : list List of `Alignment` objects for which the summary statistics will be calculated ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- summary_stats : dict Dictionary with the overall summary statistics table : list List with overall summary statistics for creating .csv tables. """ # Update active alignments if they changed since last update if active_alignments and \ active_alignments != list(self.alignments.keys()): self.update_active_alignments(active_alignments) if ns: ns.files = len(self.alignments) # Set table header for summary_stats table = [["Genes", "Taxa", "Alignment length", "Gaps", "Gaps per gene", "Missing data", "Missing data per gene", "Variable sites", "Variable sites per gene", "Informative sites", "Informative sites per gene"]] # Table line keys matching summary_stats for table completion tl = ["genes", "taxa", "seq_len", "gaps", "avg_gaps", "missing", "avg_missing", "variable", "avg_var", "informative", "avg_inf"] # Reset summary_stats self._reset_summary_stats() # Get number of alignments self.summary_stats["genes"] = len(self.alignments) # Get number of taxa self.summary_stats["taxa"] = len(self.taxa_names) # Get statistics that require iteration over alignments for p, (_, aln) in enumerate(sorted(self.alignments.items())): if ns: if ns.stop: raise KillByUser("Child thread killed by user") ns.counter += 1 # Get alignment size info self.summary_stats["seq_len"] += aln.locus_length cur_gap, cur_missing = 0, 0 cur_var, cur_inf = 0, 0 for col in aln.iter_columns(): if ns: if ns.stop: raise KillByUser("") col = Counter(col) # Get missing data and gaps if self.sequence_code[1] in col: self.summary_stats["missing"] += 1 cur_missing += 1 if self.gap_symbol in col: self.summary_stats["gaps"] += 1 cur_gap += 1 # Get variability information # Filter missing data for i in [self.sequence_code[1], self.gap_symbol]: del col[i] # If it's only missing data, ignore if col: # Get variable sites if len(col) > 1: self.summary_stats["variable"] += 1 cur_var += 1 # If any of the remaining sites is present in more than two # taxa score the site as informative if len([x for x in col.values() if x >= 2]) >= 2: self.summary_stats["informative"] += 1 cur_inf += 1 # Get values for current alignment for average calculations self.summary_stats["avg_gaps"].append(cur_gap) self.summary_stats["avg_missing"].append(cur_missing) self.summary_stats["avg_var"].append(cur_var) self.summary_stats["avg_inf"].append(cur_inf) if not any(self.summary_gene_table["genes"] == aln.name): # Get row information for current gene in dict format row_info = (aln.name, aln.locus_length, len(aln.taxa_list), cur_var, cur_inf, cur_gap, cur_missing) # Create DataFrame from dict row_dt = pd.DataFrame([row_info], index=[p], columns=self.summary_gene_table.columns) # Add row DF to main DF self.summary_gene_table = pd.concat([self.summary_gene_table, row_dt]) # Get average values for k in ["avg_gaps", "avg_missing", "avg_var", "avg_inf"]: self.summary_stats[k] = round(np.mean(self.summary_stats[k])) # Get percentage values for missing data total = self.summary_stats["seq_len"] * self.summary_stats["taxa"] for k in ["gaps", "missing"]: n = self.summary_stats[k] percentage = round((float(n) / float(total)) * 100, 2) self.summary_stats[k] = "{0:,} ({1}%)".format(n, percentage) # Get percentage values for variation for k in ["variable", "informative"]: n = self.summary_stats[k] percentage = round((float(n) / float(self.summary_stats["seq_len"])) * 100, 2) self.summary_stats[k] = "{0:,} ({1}%)".format(n, percentage) # Complete table information table.append([self.summary_stats[x] for x in tl]) return dict(self.summary_stats), table
@check_data
[docs] def gene_occupancy(self, ns=None): """Create data for gene occupancy plot. Creates data for an interpolation plot to visualize the amount of missing genes in a alignment dataset. ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title """ data = [] for alignment in self.alignments.values(): if ns: if ns.stop: raise KillByUser() data.append([1 if x in alignment.taxa_list else 0 for x in self.taxa_names]) data = np.transpose(data) return {"data": data, "ax_names": ["Genes", "Taxa"], "title": "Gene occupancy"}
@check_data
[docs] def missing_data_distribution(self, ns=None): """Create data for overall distribution of missing data plot. Creates data for overall distribution of missing data. This will calculate the amount of gaps, missing and actual data. ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "legend": mpl Legend object, "table_header": list with headers of table. """ if ns: ns.files = len(self.alignments) legend = ["Gaps", "Missing data", "Data"] data_storage = OrderedDict((x, []) for x in legend) for aln in self.alignments.values(): if ns: # Kill switch to interrupt worker if ns.stop: raise KillByUser("") # Add to progress counter ns.counter += 1 gaps_g, missing_g, data_g = [], [], [] for tx in self.taxa_names: if tx in aln.taxa_list: seq = aln.get_sequence(tx) # Get gaps gaps = float(seq.count("-")) gaps_g.append(gaps / float(aln.locus_length)) # Get missing data missing = float(seq.count(aln.sequence_code[1])) missing_g.append(missing / float(aln.locus_length)) # Get actual data actual_data = (float(aln.locus_length) - gaps - missing) / \ float(aln.locus_length) data_g.append(actual_data) else: gaps_g.append(0) missing_g.append(1) data_g.append(0) data_storage["Gaps"].append(np.mean(gaps_g)) data_storage["Missing data"].append(np.mean(missing_g)) data_storage["Data"].append(np.mean(data_g)) data = [x for x in data_storage.values()] return {"data": data, "title": "Distribution of missing data", "legend": legend, "ax_names": ["Proportion", "Number of genes"], "table_header": ["Bin"] + legend}
@check_data
[docs] def missing_data_per_species(self, ns=None): """Creates data for missing data proportion per species plot. For each taxon in `taxon_list` calculate the proportion of gap, missing and actual data. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "legend": mpl Legend object, "table_header": list with headers of table, "normalize": bool, whether data values should be normalized, "normalize_factor": int, factor to use in normalization """ # Data for a stacked bar plot. First element for gaps, second for # missing, third for actual data data_storage = OrderedDict((taxon, [0, 0, 0]) for taxon in self.taxa_names) total_len = 0 legend = ["Gaps", "Missing", "Data"] if ns: ns.files = len(self.alignments) for aln in self.alignments.values(): if ns: # Kill switch to interrupt worker if ns.stop: raise KillByUser("") # Add to progress counter ns.counter += 1 total_len += aln.locus_length for taxon in data_storage: if taxon in aln.taxa_list: # Get gaps seq = aln.get_sequence(taxon) gaps = seq.count("-") data_storage[taxon][0] += gaps # Get missing missing = seq.count(aln.sequence_code[1]) data_storage[taxon][1] += missing # Get actual data actual_data = aln.locus_length - gaps - missing data_storage[taxon][2] += actual_data else: data_storage[taxon][1] += aln.locus_length data_storage = OrderedDict(sorted(data_storage.items(), key=lambda x: x[1][1] + x[1][0], reverse=True)) data = np.array([[float(x[0]) for x in data_storage.values()], [float(x[1]) for x in data_storage.values()], [float(x[2]) for x in data_storage.values()]]) return {"data": data, "title": "Distribution of missing data per species", "labels": list(data_storage.keys()), "legend": legend, "ax_names": [None, "Frequency"], "table_header": ["Taxon", "Gaps", "%", "Missing", "%", "Data", "%"], "normalize": True, "normalize_factor": total_len}
@check_data
[docs] def missing_genes_per_species(self, ns=None): """Creates data with distribution of missing genes per species plot. Calculates, for each taxon, the number of alignments where it is present. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "labels": list, label for xticks, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "table_header": list with headers of table """ data_storage = OrderedDict((taxon, 0) for taxon in self.taxa_names) if ns: ns.files = len(self.alignments) for aln in self.alignments.values(): if ns: # Kill switch to interrupt worker if ns.stop: raise KillByUser("") # Add to progress counter ns.counter += 1 for key in data_storage: if key not in aln.taxa_list: data_storage[key] += 1 # Sort data in descending order of missing genes data_storage = OrderedDict(sorted(data_storage.items(), reverse=True, key=lambda t: t[1])) return {"data": [list(data_storage.values())], "labels": list(data_storage.keys()), "title": "Distribution of missing genes per species", "ax_names": [None, "Frequency"], "table_header": ["Taxon", "Missing genes"] }
@check_data
[docs] def missing_genes_average(self, ns=None): """Creates histogram data with average missing taxa. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "table_header": list with headers of table """ data = [] if ns: ns.files = len(self.alignments) for aln in self.alignments.values(): if ns: # Kill switch if ns.stop: raise KillByUser("") ns.counter += 1 data.append(len(set(self.taxa_names) - set(aln.taxa_list))) return {"data": data, "title": "Distribution of missing taxa", "ax_names": ["Number of missing taxa", "Frequency"], "table_header": ["Number of missing taxa", "Frequency"]}
@check_data
[docs] def average_seqsize_per_species(self, ns=None): """Creates data for average sequence size per taxon. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "labels": list, label for xticks, "ax_names": 2 element list with axis labels [x, y], "title": str with title """ data_storage = OrderedDict((taxon, []) for taxon in self.taxa_names) if ns: ns.files = len(self.alignments) for aln in self.alignments.values(): if ns: # Kill switch if ns.stop: raise KillByUser("") ns.counter += 1 for sp, seq in aln: data_storage[sp].append(len(seq.replace("-", ""). replace(aln.sequence_code[1], ""))) # Adapt y-axis label according to sequence code seq_code = aln.sequence_code[0] ax_ylabel = "Size (bp)" if seq_code == "DNA" else "Size (residues)" data_storage = OrderedDict(sorted(data_storage.items(), reverse=True, key=lambda t: np.mean(t[1]))) return {"data": list(data_storage.values()), "labels": list(data_storage.keys()), "title": "Sequence size distribution per species", "ax_names": [None, ax_ylabel]}
@check_data
[docs] def average_seqsize(self, ns=None): """Creates data for the average sequence size for the entire data set. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "table_header": list with headers of table """ data_storage = [] if ns: ns.files = len(self.alignments) for aln in self.alignments.values(): if ns: # Kill switch if ns.stop: raise KillByUser("") ns.counter += 1 data_storage.append(aln.locus_length) # Adapt y-axis label according to sequence code seq_code = aln.sequence_code[0] ax_xlabel = "Size (bp)" if seq_code == "DNA" else "Size (residues)" return {"data": data_storage, "title": "Average sequence size distribution", "ax_names": [ax_xlabel, "Frequency"], "table_header": [ax_xlabel, "Frequency"]}
@check_data
[docs] def characters_proportion(self, ns=None): """Creates data for the proportion of nt/aa for the data set. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "labels": list, label for xticks, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "table_header": list with headers of table, """ data_storage = Counter() if ns: ns.files = len(self.alignments) for aln in self.alignments.values(): if ns: #Kill switch if ns.stop: raise KillByUser("") ns.counter += 1 for seq in aln.iter_sequences(): data_storage += Counter(seq.replace("-", ""). replace(self.sequence_code[1], "")) # Determine total number of characters chars = float(sum(data_storage.values())) # Valid characters list valid_chars = dna_chars if self.sequence_code[0] == "DNA" else \ list(aminoacid_table.keys()) data, xlabels = zip(*[(float(x) / chars, y.upper()) for y, x in data_storage.items() if y in valid_chars]) title = "Nucleotide proportions" if self.sequence_code[0] == "DNA" \ else "Amino acid proportions" ax_xlabel = "Nucleotide" if self.sequence_code[0] == "DNA" \ else "Amino acid" return {"data": [data], "labels": xlabels, "title": title, "ax_names": [ax_xlabel, "Proportion"], "table_header": [ax_xlabel, "Proportion"]}
@check_data
[docs] def characters_proportion_per_species(self, ns=None): """Creates data for the proportion of nt/aa per species. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "labels": list, label for xticks, "ax_names": 2 element list with axis labels [x, y], "legend": mpl Legend object, "title": str with title, "table_header": list with headers of table} """ if ns: ns.files = len(self.alignments) data_storage = OrderedDict((x, Counter()) for x in self.taxa_names) for aln in self.alignments.values(): if ns: # Kill switch if ns.stop: raise KillByUser("") ns.counter += 1 for sp, seq in aln: data_storage[sp] += Counter(seq.replace("-", ""). replace(self.sequence_code[1], "")) legend = dna_chars if self.sequence_code[0] == "DNA" else \ list(aminoacid_table.keys()) data = [[] for _ in legend] for p, char in enumerate(legend): for c in data_storage.values(): chars = float(sum([x for y, x in c.items() if y in legend])) data[p].append(float(c[char]) / chars) data = np.array(data) title = "Nucleotide proportions" if self.sequence_code[0] == "DNA" \ else "Amino acid proportions" ax_ylabel = "Nucleotide" if self.sequence_code[0] == "DNA" \ else "Amino acid" return {"data": data, "title": title, "labels": list(data_storage.keys()), "legend": legend, "ax_names": ["Taxa", ax_ylabel], "table_header": ["Taxon"] + legend}
@LookupDatabase def _get_similarity(self, seq1, seq2, aln_len): """Gets the similarity between two strings and the effective length. Compares two sequences and calculates the number of similarities. This comparison ignores columns with missing data, so the final sequence length that is actually compared has to be adjusted. Parameters ---------- seq1, seq2 : str Sequence strings. aln_len : int Original lenght of `seq1` and `seq2`. Returns ------- sim : float Number of pairwise similarities. ef_len : float Effective sequence length, without missing data. Notes ----- To avoid performing the calculations between identical string pairs, this method is decorated with a class that stores the calculation result of new string pairs in a database, and fetches that result when the same string pair appears later on. """ seq1 = np.array(list(seq1)) seq2 = np.array(list(seq2)) ef_len_ = (seq1 != self.sequence_code[1]) & \ (seq1 != self.gap_symbol) & \ (seq2 != self.sequence_code[1]) & \ (seq2 != self.gap_symbol) sim_ = (seq1 == seq2) & ef_len_ sim = np.count_nonzero(sim_) ef_len = np.count_nonzero(ef_len_) return float(sim), float(ef_len)
[docs] def _get_informative_sites(self, aln): """Calculates number of informative sites in an `Alignment`. Parameters ---------- aln : trifusion.process.sequence.Alignment `Alignment` object. Returns ------- informative_sites : int Number of informative sites for `Alignment` object. """ informative_sites = 0 for column in aln.iter_columns(): column = Counter([x for x in column if x != aln.sequence_code[1] and x != self.gap_symbol]) if len(column) > 1: # Remove most common and check the length of the remaining del column[column.most_common()[0][0]] if sum(column.values()) > 1: informative_sites += 1 return informative_sites
@check_data
[docs] def sequence_similarity(self, ns=None): """Creates data for average sequence similarity plot. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y] """ self._get_similarity("connect") data = [] if ns: ns.files = len(self.alignments) for aln in self.alignments.values(): # self._get_reference_data(list(aln.sequences())) if ns: ns.counter += 1 aln_similarities = [] for seq1, seq2 in itertools.combinations(aln.iter_sequences(), 2): if ns: if ns.stop: raise KillByUser("") sim, total_len = self._get_similarity(seq1, seq2, aln.locus_length) if total_len: aln_similarities.append(sim / total_len) if aln_similarities: data.append(np.mean(aln_similarities) * 100) self._get_similarity("disconnect") return {"data": data, "ax_names": ["Similarity (%)", "Frequency"]}
@check_data
[docs] def sequence_similarity_per_species(self, ns=None): """Creates data for sequence similarity per species pair plot. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "labels": list, label for xticks, "color_label": str, label for colorbar """ if ns: ns.files = len(self.alignments) self._get_similarity("connect") # Create matrix for parwise comparisons data = [np.empty((len(self.taxa_names), 0)).tolist() for _ in range(len(self.taxa_names))] taxa_pos = OrderedDict((x, y) for y, x in enumerate(self.taxa_names)) for aln in self.alignments.values(): if ns: ns.counter += 1 for tx1, tx2 in itertools.combinations(taxa_pos.keys(), 2): if ns: # Kill switch if ns.stop: raise KillByUser("") try: seq1, seq2 = aln.get_sequence(tx1), aln.get_sequence(tx2) except KeyError: continue sim, l = self._get_similarity(seq1, seq2, aln.locus_length) if l: data[taxa_pos[tx1]][taxa_pos[tx2]].append(sim / l) data = np.array([[np.mean(y) if y else 0. for y in x] for x in data]) mask = np.tri(data.shape[0], k=0) data = np.ma.array(data, mask=mask) self._get_similarity("disconnect") return {"data": data, "color_label": "Pairwise sequence similarity", "labels": list(taxa_pos)}
@check_data
[docs] def sequence_similarity_gene(self, gene_name, window_size, ns=None): """Creates data for sliding window sequence similarity for alignment. Retrieves an alignment using `gene_name` and calculates the average pair-wise sequence similarity along the alignment length using a sliding window approach. Parameters ---------- gene_name : str `Alignment.name` from an alignment window_size : int Size of sliding window. ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "window_size": int, size of sliding window, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "table_header": list with headers of table """ aln_obj = self.retrieve_alignment(gene_name) data = [] for i in range(0, aln_obj.locus_length, window_size): window_similarities = [] seqs = np.array([[y for y in x[i:i + window_size]] for x in aln_obj.iter_sequences()]) for seq1, seq2 in itertools.combinations(seqs, 2): s, t = self._get_similarity("".join(seq1), "".join(seq2), window_size) if t: sim = s / t else: sim = 0 window_similarities.append(sim * 100) if window_similarities: data.append(np.mean(window_similarities)) return {"data": data, "title": "Sequence similarity sliding window for gene\n %s" % basename(gene_name), "window_size": window_size, "ax_names": ["Sequence (bp)", "Similarity (%)"], "table_header": ["Sequence (bp)", "Similarity (%)"]}
@check_data
[docs] def sequence_segregation(self, ns=None, proportions=False): """Creates data for overall distribution of segregating sites. Parameters ---------- proportions : bool If True, use proportions instead of absolute values ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "table_header": list with headers of table, "real_bin_num": """ if ns: ns.files = len(self.alignments) data = [] for aln in self.alignments.values(): segregating_sites = 0 if ns: # Kill switch if ns.stop: raise KillByUser("") ns.counter += 1 for column in aln.iter_columns(): # Remove gaps and missing characters column = set([x for x in column if x != aln.sequence_code[1] and x != self.gap_symbol]) if len(column) > 1: segregating_sites += 1 if proportions: data.append( float(segregating_sites / float(aln.locus_length))) ax_names = ["Segregating sites", "Percentage"] real_bin = False else: data.append(segregating_sites) ax_names = ["Segregating sites", "Frequency"] real_bin = True return {"data": data, "ax_names": ax_names, "title": "Distribution of segregating sites", "table_header": ax_names, "real_bin_num": real_bin}
@check_data
[docs] def sequence_segregation_per_species(self, ns=None): """Creates data for a triangular matrix of sequence segregation for pairs of taxa. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "labels": list, label for xticks, "color_label": str, label for colorbar """ if ns: ns.files = len(self.alignments) self._get_similarity("connect") # Create matrix for parwise comparisons data = [np.empty((len(self.taxa_names), 0)).tolist() for _ in range(len(self.taxa_names))] taxa_pos = OrderedDict((x, y) for y, x in enumerate(self.taxa_names)) for aln in self.alignments.values(): if ns: ns.counter += 1 for tx1, tx2 in itertools.combinations(taxa_pos.keys(), 2): if ns: if ns.stop: raise KillByUser("") try: seq1, seq2 = aln.get_sequence(tx1), aln.get_sequence(tx2) except KeyError: continue s, t = self._get_similarity(seq1, seq2, aln.locus_length) if t: aln_diff = t - s else: aln_diff = 0 data[taxa_pos[tx1]][taxa_pos[tx2]].append(aln_diff) data = np.array([[np.mean(y) if y else 0. for y in x] for x in data]) mask = np.tri(data.shape[0], k=0) data = np.ma.array(data, mask=mask) self._get_similarity("disconnect") return {"data": data, "labels": list(taxa_pos), "color_label": "Segregating sites"}
@check_data
[docs] def sequence_segregation_gene(self, gene_name, window_size, ns=None): """Create data for a sliding window analysis of segregating sites. Retrieves an alignment using `gene_name` and calculates the number of segregating sites along the alignment length using a sliding window approach. Parameters ---------- gene_name : str `Alignment.name` from an alignment window_size : int Size of sliding window. ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "window_size": int, size of sliding window, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "table_header": list with headers of table """ aln_obj = self.retrieve_alignment(gene_name) data = [] for i in range(0, aln_obj.locus_length, window_size): segregating_sites = 0 seqs = np.array([[y for y in x[i:i + window_size]] for x in aln_obj.iter_sequences()]) for column in zip(*seqs): column = set([x for x in column if x != aln_obj.sequence_code[1] and x != "-"]) if len(column) > 1: segregating_sites += 1 data.append(segregating_sites) return {"data": data, "title": "Number of segregating sites sliding window for " "gene\n %s" % basename(gene_name), "window_size": window_size, "ax_names": ["Sequence (bp)", "Segregating sites"], "table_header": ["Sequence (bp)", "Segregating sites"]}
@check_data
[docs] def length_polymorphism_correlation(self, ns=None): """Creates data for correlation between alignment length and informative sites. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "correlation": bool, whether to perform spearman correlation, "title": str with title, "table_header": list with headers of table """ if ns: ns.files = len(self.alignments) data_length = [] data_inf = [] for aln in self.alignments.values(): if ns: # Kill switch if ns.stop: raise KillByUser("") ns.counter += 1 # Get informative sites for alignment inf_sites = self._get_informative_sites(aln) data_inf.append(inf_sites) # Get size data_length.append(aln.locus_length) return {"data": [data_length, data_inf], "title": "Correlation between alignment length and number of " "variable sites", "ax_names": ["Alignment length", "Informative sites"], "table_header": ["Alignment length", "Informative sites"], "correlation": True}
@check_data
[docs] def allele_frequency_spectrum(self, ns=None, proportions=False): """Creates data for the allele frequency spectrum. Calculates the allele frequency spectrum of the entire alignment data set. Here, multiple alignments are effectively treated as a single one. This method is exclusive of DNA sequence type and supports IUPAC ambiguity codes. Parameters ---------- proportions : bool If True, use proportions instead of absolute values ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "table_header": list with headers of table """ # Make check for sequence type consistency here for TriStats.py. In # TriFusion the check is made before calling this method if self.sequence_code[0] != "DNA": return {"exception": InvalidSequenceType} data = [] if ns: ns.files = len(self.alignments) for aln in self.alignments.values(): if ns: if ns.stop: raise KillByUser("") ns.counter += 1 for column in aln.iter_columns(): col = [iupac_conv[x] for x in column if x != aln.sequence_code[1] and x != self.gap_symbol] col_len = len(col) # Remove gaps and missing characters column = Counter(col) # Consider only bi-allelic SNPs if len(column) == 2: # Remove most common and check the length of the remaining del column[column.most_common()[0][0]] # Append number of derived alleles if proportions: data.append(float(sum(column.values())) / float(col_len)) else: data.append(sum(column.values())) return {"data": data, "title": "Allele frequency spectrum", "ax_names": ["Derived allele frequency", "Frequency"], "table_header": ["Derived allele frequency", "Frequency"], "real_bin_num": True}
@check_data
[docs] def allele_frequency_spectrum_gene(self, gene_name): """Creates data for the allele frequency spectrum of an `Alignment`. Parameters ---------- gene_name : str `Alignment.name` of an alignment. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "table_header": list with headers of table, "real_bin_num": """ # Make check for sequence type consistency here for TriStats.py. In # TriFusion the check is made before calling this method if self.sequence_code[0] != "DNA": return {"exception": InvalidSequenceType} aln = self.retrieve_alignment(gene_name) data = [] for column in aln.iter_columns(): # Remove gaps and missing characters column = Counter([iupac_conv[x] for x in column if x != aln.sequence_code[1] and x != self.gap_symbol]) # Consider only bi-allelic SNPs if len(column) == 2: # Remove most common and check the length of the remaining del column[column.most_common()[0][0]] # Append number of derived alleles data.append(sum(column.values())) return {"data": data, "title": "Allele frequency spectrum", "ax_names": ["Derived allele frequency", "Frequency"], "table_header": ["Derived allele frequency", "Frequency"], "real_bin_num": True}
@check_data
[docs] def taxa_distribution(self, ns=None): """Creates data for a distribution of taxa frequency across alignments. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "labels": list, label for xticks, "title": str with title, "color_label": str, label for colorbar, "real_bin_num": """ if ns: ns.files = len(self.alignments) data = [] for aln in self.alignments.values(): if ns: if ns.stop: raise KillByUser("") ns.counter += 1 # Get number of taxa data.append(len(aln.taxa_list)) return {"data": data, "title": "Distribution of taxa frequency", "ax_names": ["Number of taxa", "Frequency"], "table_header": ["Number of taxa", "Frequency"], "real_bin_num": True}
@check_data
[docs] def cumulative_missing_genes(self, ns=None): """Creates data for a distribution of the maximum number of genes available for consecutive thresholds of missing data. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "labels": list, label for xticks, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "table_header": list with headers of table} """ size_storage = [] data = [] if ns: ns.files = len(self.alignments) # total number of taxa in data set taxa = float(len(self.taxa_names)) for aln in self.alignments.values(): if ns: # Kill switch if ns.stop: raise KillByUser ns.counter += 1 # Get number of taxa size_storage.append((float(len(aln.taxa_list)) / taxa) * 100) labels = [] for i in xrange(0, 105, 5): # Get percentage data.append(len([x for x in size_storage if x > i])) labels.append(str(i)) return {"data": [data], "title": "Gene frequency for decreasing values of missing " "data", "labels": labels, "ax_names": ["Minimum taxa representation", "Gene frequency"], "table_header": ["Percentage", "Frequency"]}
@staticmethod
[docs] def _mad_based_outlier(p, threshold=3.5): """Calculate MAD based outliers. An outlier detection method based on median absolute deviation. This code was adapted from http://stackoverflow.com/a/22357811/1990165. The usage of the median is much less biased that the mean and is robust to smaller data sets. Parameters ---------- p : numpy.array Array with data observations. threshold : float Modified Z-score to use as a threshold. """ if len(p.shape) == 1: p = p[:, None] # Get sample median. median = np.median(p, axis=0) diff = np.sum((p - median) ** 2, axis=-1) diff = np.sqrt(diff) med_abs_deviation = np.median(diff) # The 0.6745is a constant that makes values roughly equivalent in # units to standard deviations. z_score = 0.6745 * diff / med_abs_deviation return z_score > threshold
@check_data
[docs] def outlier_missing_data(self, ns=None): """Create data for missing data alignment outliers. Get data for outlier detection of genes based on the distribution of average missing data per gene. Data points will be based on the proportion of missing data symbols out of the possible total. For example, in an alignment with three taxa, each with 100 sites, the total possible missing data is 300 (100 * 3). Here, missing data will be gathered from all taxa and a proportion will be calculated based n the total possible Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "outliers": list of outlier data points, "outliers_labels": list of outlier labels """ if ns: ns.files = len(self.alignments) data_labels = [] data_points = [] for gn, aln in self.alignments.items(): if ns: if ns.stop: raise KillByUser("") ns.counter += 1 total_len = aln.locus_length * len(aln.taxa_list) gn_data = 0 for seq in aln.iter_sequences(): gn_data += seq.count(self.sequence_code[1]) + \ seq.count(self.gap_symbol) m_data = float(gn_data) / float(total_len) data_points.append(m_data) data_labels.append(gn) data_points = np.asarray(data_points) data_labels = np.asarray(data_labels) # Get outliers outliers_points = data_points[self._mad_based_outlier(data_points)] # Get outlier labels outliers_labels = list(data_labels[self._mad_based_outlier( data_points)]) return {"data": data_points, "title": "Missing data outlier gene detection", "outliers": outliers_points, "outliers_labels": outliers_labels, "ax_names": ["Proportion of missing data", "Frequency"]}
@check_data
[docs] def outlier_missing_data_sp(self, ns=None): """Create data for missing data taxa outliers. Gets data for outlier detection of species based on missing data. For this analysis, genes for which a taxa is completely absent will be ignored for the calculations of that taxa. The reason for this, is that including genes where the taxon is absent would bias the outlier detection towards taxa that have low prevalence in the data set, even if they have low missing data in the alignments where they are present. Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "outliers": list of outlier data points, "outliers_labels": list of outlier labels} """ if ns: ns.files = len(self.alignments) data = dict((tx, []) for tx in self.taxa_names) for aln in self.alignments.values(): if ns: if ns.stop: raise KillByUser("") ns.counter += 1 total_len = aln.locus_length # Get missing data for every taxon for tx in data: if tx in aln.taxa_list: seq = aln.get_sequence(tx) m_data = float(seq.count(self.sequence_code[1]) + seq.count(self.gap_symbol)) / \ float(total_len) data[tx].append(m_data) # Get average for each taxon for tx, vals in data.items(): data[tx] = np.mean(vals) # Prepara data for plotting data_points = [] data_labels = [] for tx, vals in data.items(): data_points.append(vals) data_labels.append(tx) data_points = np.asarray(data_points) data_labels = np.asarray(data_labels) # Get outliers outliers_points = data_points[self._mad_based_outlier(data_points)] # Get outlier taxa outlier_labels = list(data_labels[self._mad_based_outlier(data_points)]) return {"data": data_points, "title": "Missing data outlier taxa detection", "outliers": outliers_points, "outliers_labels": outlier_labels, "ax_names": ["Proportion of missing data", "Frequency"]}
@check_data
[docs] def outlier_segregating(self, ns=None): """Creates data for segregating site alignment outliers. Generates data for the outlier detection of genes based on segregating sites. The data will be based on the number of alignments columns with a variable number of sites, excluding gaps and missing data Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "outliers": list of outlier data points, "outliers_labels": list of outlier labels """ if ns: ns.files = len(self.alignments) data_points = [] data_labels = [] for aln in self.alignments.values(): if ns: if ns.stop: raise KillByUser("") ns.counter += 1 segregating_sites = 0 for column in aln.iter_columns(): # Remove gaps and missing characters column = set([x for x in column if x != aln.sequence_code[1] and x != self.gap_symbol]) if len(column) > 1: segregating_sites += 1 # Get proportion of segregating sites for current alignment data_points.append(float(segregating_sites) / float(aln.locus_length)) data_labels.append(aln.name) data_points = np.asarray(data_points) data_labels = np.asarray(data_labels) # Get outliers outliers_points = data_points[self._mad_based_outlier(data_points)] # Get outlier taxa outlier_labels = list(data_labels[self._mad_based_outlier(data_points)]) return {"data": data_points, "title": "Sequence variation outlier gene detection", "outliers": outliers_points, "outliers_labels": outlier_labels, "ax_names": ["Proportion of segregating sites", "Frequency"]}
@check_data
[docs] def outlier_segregating_sp(self, ns=None): """Create data for segregating sites taxa outliers. Generates data for the outlier detection of species based on their average pair-wise proportion of segregating sites. Comparisons with gaps or missing data are ignored Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "outliers": list of outlier data points, "outliers_labels": list of outlier labels """ if ns: ns.files = len(self.alignments) self._get_similarity("connect") data = OrderedDict((tx, []) for tx in self.taxa_names) for aln in self.alignments.values(): if ns: ns.counter += 1 for tx1, tx2 in itertools.combinations(data.keys(), 2): if ns: if ns.stop: raise KillByUser("") try: seq1, seq2 = aln.get_sequence(tx1), aln.get_sequence(tx2) except KeyError: continue s, t_len = self._get_similarity(seq1, seq2, aln.locus_length) if t_len: s_data = (t_len - s) / t_len else: s_data = 0 data[tx1].append(s_data) data[tx2].append(s_data) for tx, vals in data.items(): data[tx] = np.mean(vals) # Prepara data for plotting data_points = [] data_labels = [] for tx, vals in data.items(): data_points.append(vals) data_labels.append(tx) data_points = np.asarray(data_points) data_labels = np.asarray(data_labels) # Get outliers outliers_points = data_points[self._mad_based_outlier(data_points)] # Get outlier taxa outlier_labels = list(data_labels[self._mad_based_outlier(data_points)]) self._get_similarity("disconnect") return {"data": data_points, "title": "Sequence variation outlier taxa detection", "outliers": outliers_points, "outliers_labels": outlier_labels, "ax_names": ["Proportion of segregating sites", "Frequency"]}
@check_data
[docs] def outlier_sequence_size(self, ns=None): """Creates data for sequence size alignment outliers. Generates data for the outlier detection of genes based on their sequence length (excluding missing data). Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "outliers": list of outlier data points, "outliers_labels": list of outlier labels """ if ns: ns.files = len(self.alignments) data_labels = [] data_points = [] for gn, aln in self.alignments.items(): if ns: if ns.stop: raise KillByUser("") ns.counter += 1 gn_l = [] for seq in aln.iter_sequences(): gn_l.append(len(seq. replace(aln.sequence_code[1], ""). replace(self.gap_symbol, ""))) gn_avg = np.mean(gn_l) data_points.append(gn_avg) data_labels.append(gn) data_points = np.asarray(data_points) data_labels = np.asarray(data_labels) # Get outliers outliers_points = data_points[self._mad_based_outlier(data_points)] # Get outliers labels outliers_labels = list(data_labels[self._mad_based_outlier( data_points)]) return {"data": data_points, "title": "Sequence size outlier gene detection", "outliers": outliers_points, "outliers_labels": outliers_labels, "ax_names": ["Sequence size", "Frequency"]}
@check_data
[docs] def outlier_sequence_size_sp(self, ns=None): """Create data for sequence size taxa outliers. Generates data for the outlier detection of species based on their sequence length (excluding missing data) Parameters ---------- ns : multiprocesssing.Manager.Namespace A Namespace object used to communicate with the main thread in TriFusion. Returns ------- _ : dict "data": `numpy.array` with data for plotting, "ax_names": 2 element list with axis labels [x, y], "title": str with title, "outliers": list of outlier data points, "outliers_labels": list of outlier labels """ if ns: ns.files = len(self.alignments) data = dict((tx, []) for tx in self.taxa_names) for aln in self.alignments.values(): if ns: if ns.stop: raise KillByUser("") ns.counter += 1 for tx in data: if tx in aln.taxa_list: seq = aln.get_sequence(tx) s_data = len(seq.replace(aln.sequence_code[1], ""). replace(self.gap_symbol, "")) data[tx].append(s_data) # Get average for each taxon for tx, vals in data.items(): data[tx] = np.mean(vals) # Preparing data for plotting data_points = [] data_labels = [] for tx, vals in data.items(): data_points.append(vals) data_labels.append(tx) data_points = np.asarray(data_points) data_labels = np.asarray(data_labels) # Get outliers outliers_points = data_points[self._mad_based_outlier(data_points)] # Get outlier taxa outlier_labels = list(data_labels[self._mad_based_outlier(data_points)]) return {"data": data_points, "title": "Sequence size outlier taxa detection", "outliers": outliers_points, "outliers_labels": outlier_labels, "ax_names": ["Sequence size", "Frequency"]}
__author__ = "Diogo N. Silva" __credits__ = ["Diogo N. Silva", "Tiago F. Jesus"]