trifusion.ortho.OrthomclToolbox module

class trifusion.ortho.OrthomclToolbox.Cluster(line_string)[source]

Bases: object

Object for clusters of the OrthoMCL groups file. It is useful to set a number of attributes that will make subsequent filtration and processing much easier

Methods

apply_filter(gene_threshold, species_threshold) This method will update two Cluster attributes, self.gene_flag and
parse_string(cluster_string) Parses the string and sets the group name and sequence list attributes
remove_taxa(taxa_list) Removes the taxa contained in taxa_list from self.sequences and
apply_filter(gene_threshold, species_threshold)[source]

This method will update two Cluster attributes, self.gene_flag and self.species_flag, which will inform downstream objects if this cluster respects the gene and species threshold :param gene_threshold: Integer for the maximum number of gene copies per species :param species_threshold: Integer for the minimum number of species present

parse_string(cluster_string)[source]

Parses the string and sets the group name and sequence list attributes

remove_taxa(taxa_list)[source]

Removes the taxa contained in taxa_list from self.sequences and self.species_frequency :param taxa_list: list, each element should be a taxon name

class trifusion.ortho.OrthomclToolbox.Group(groups_file, gene_threshold=None, species_threshold=None, project_prefix='MyGroups')[source]

Bases: object

This represents the main object of the orthomcl toolbox module. It is initialized with a file name of a orthomcl groups file and provides several methods that act on that group file. To process multiple Group objects, see MultiGroups object

Methods

bar_genecopy_distribution([dest, filt, ...]) Creates a bar plot with the distribution of gene copies across clusters :param dest: string, destination directory :param filt: Boolean, whether or not to use the filtered groups.
bar_species_coverage([dest, filt, ns, ...]) Creates a stacked bar plot with the proportion of
bar_species_distribution([dest, filt, ns, ...]) Creates a bar plot with the distribution of species numbers across clusters :param dest: string, destination directory :param filt: Boolean, whether or not to use the filtered groups.
basic_group_statistics() This method creates a basic table in list format containing basic
exclude_taxa(taxa_list) Adds a taxon_name to the excluded_taxa list and updates the
export_filtered_group([output_file_name, ...]) Export the filtered groups into a new file.
get_filters() Returns a tuple with the thresholds for max gene copies and min species
paralog_per_species_statistic([...]) This method creates a CSV table with information on the number of paralog clusters per species :param output_file_name: string.
retrieve_sequences(database[, dest, mode, ...]) When provided with a database in Fasta format, this will use the Alignment object to retrieve sequences :param database: String.
update_filtered_group() This method creates a new filtered group variable, like
update_filters(gn_filter, sp_filter) Sets new values for the self.species_threshold and self.gene_threshold and updates the filtered_group :param gn_filter: int.
bar_genecopy_distribution(dest='./', filt=False, output_file_name='Gene_copy_distribution.png')[source]

Creates a bar plot with the distribution of gene copies across clusters :param dest: string, destination directory :param filt: Boolean, whether or not to use the filtered groups. :param output_file_name: string, name of the output file

bar_species_coverage(dest='./', filt=False, ns=None, output_file_name='Species_coverage')[source]

Creates a stacked bar plot with the proportion of :return:

bar_species_distribution(dest='./', filt=False, ns=None, output_file_name='Species_distribution')[source]

Creates a bar plot with the distribution of species numbers across clusters :param dest: string, destination directory :param filt: Boolean, whether or not to use the filtered groups. :param output_file_name: string, name of the output file

basic_group_statistics()[source]

This method creates a basic table in list format containing basic information of the groups file (total number of clusters, total number of sequences, number of clusters below the gene threshold, number of clusters below the species threshold and number of clusters below the gene AND species threshold) :return: List containing number of

[total clusters,
total sequences, clusters above gene threshold, clusters above species threshold, clusters above gene and species threshold]
exclude_taxa(taxa_list)[source]

Adds a taxon_name to the excluded_taxa list and updates the filtered_groups list

export_filtered_group(output_file_name='filtered_groups', dest='./', get_stats=False, shared_namespace=None)[source]

Export the filtered groups into a new file. :param output_file_name: string, name of the filtered groups file :param dest: string, path to directory where the filtered groups file will be created :param get_stats: Boolean, whether to return the basic count stats or not :param shared_namespace: Namespace object, for communicating with main process.

get_filters()[source]

Returns a tuple with the thresholds for max gene copies and min species

paralog_per_species_statistic(output_file_name='Paralog_per_species.csv', filt=True)[source]

This method creates a CSV table with information on the number of paralog clusters per species :param output_file_name: string. Name of the output csv file :param filt: Boolean. Whether to use the filtered groups (True) or total groups (False)

retrieve_sequences(database, dest='./', mode='fasta', filt=True, shared_namespace=None)[source]

When provided with a database in Fasta format, this will use the Alignment object to retrieve sequences :param database: String. Fasta file :param dest: directory where files will be save :param mode: string, whether to retrieve sequences to a file (‘fasta’), or a dictionary (‘dict’) :param filt: Boolean. Whether to use the filtered groups (True) or total groups (False) :param shared_namespace: Namespace object. This argument is meant for when fast are retrieved in a background process, where there is a need to update the main process of the changes in this method :param dest: string. Path to directory where the retrieved sequences will be created.

update_filtered_group()[source]

This method creates a new filtered group variable, like export_filtered_group, but instead of writing into a new file, it replaces the self.filtered_groups variable

update_filters(gn_filter, sp_filter)[source]

Sets new values for the self.species_threshold and self.gene_threshold and updates the filtered_group :param gn_filter: int. Maximum value for gene copies in cluster :param sp_filter: int. Minimum value for species in cluster

class trifusion.ortho.OrthomclToolbox.GroupLight(groups_file, gene_threshold=None, species_threshold=None, ns=None)[source]

Bases: object

Analogous to Group object but with several changes to reduce memory usage

Methods

bar_genecopy_distribution([filt]) Creates a bar plot with the distribution of gene copies across clusters :param filt: Boolean, whether or not to use the filtered groups.
bar_genecopy_per_species([filt])
bar_species_coverage([filt]) Creates a stacked bar plot with the proportion of
bar_species_distribution([filt])
basic_group_statistics([update_stats])
exclude_taxa(taxa_list[, update_stats]) Updates the excluded_taxa attribute and updates group statistics if update_stats is True.
export_filtered_group([output_file_name, ...])
groups() Generator for group file.
iter_species_frequency() In order to prevent permanent changes to the species_frequency attribute due to the filtering of taxa, this iterable should be used instead of the said variable.
retrieve_sequences(sqldb, protein_db[, ...])
param sqldb:srting. Path to sqlite database file
update_filters(gn_filter, sp_filter[, ...]) Updates the group filter attributes and group summary stats if update_stats is True.
bar_genecopy_distribution(filt=False)[source]

Creates a bar plot with the distribution of gene copies across clusters :param filt: Boolean, whether or not to use the filtered groups.

bar_genecopy_per_species(filt=False)[source]
bar_species_coverage(filt=False)[source]

Creates a stacked bar plot with the proportion of :return:

bar_species_distribution(filt=False)[source]
basic_group_statistics(update_stats=True)[source]
exclude_taxa(taxa_list, update_stats=False)[source]

Updates the excluded_taxa attribute and updates group statistics if update_stats is True. This does not change the Group object data permanently, only sets an attribute that will be taken into account when plotting and exporting data. :param taxa_list: list. List of taxa that should be excluded from downstream operations :param update_stats: boolean. If True, it will update the group statistics

export_filtered_group(output_file_name='filtered_groups', dest='./', shared_namespace=None)[source]
groups()[source]

Generator for group file. This replaces the self.groups attribute of the original Group Object. Instead of loading the whole file into memory, a generator is created to iterate over its contents. It may run a bit slower but its a lot more memory efficient. :return:

iter_species_frequency()[source]

In order to prevent permanent changes to the species_frequency attribute due to the filtering of taxa, this iterable should be used instead of the said variable. This creates a temporary deepcopy of species_frequency which will be iterated over and eventually modified.

retrieve_sequences(sqldb, protein_db, dest='./', shared_namespace=None, outfile=None)[source]
Parameters:
  • sqldb – srting. Path to sqlite database file
  • protein_db – string. Path to protein database file
  • dest – string. Directory where sequences will be exported
  • shared_namespace – Namespace object to communicate with

TriFusion’s main process :param outfile: If set, all sequeces will be instead saved in a single output file. This is used for the nucleotide sequence export :return:

update_filters(gn_filter, sp_filter, update_stats=False)[source]

Updates the group filter attributes and group summary stats if update_stats is True. This method does not change the data of the Group object, only sets attributes that will be taken into account when plotting or exporting data :param gn_filter: integer. Maximum number of gene copies allowed in an ortholog cluster :param sp_filter: integer/float. Minimum number/proportion of taxa representation :param update_stats: boolean. If True it will update the group summary statistics

class trifusion.ortho.OrthomclToolbox.MultiGroups(groups_files=None, gene_threshold=None, species_threshold=None, project_prefix='MyGroups')[source]

Bases: object

Creates an object composed of multiple Group objects

Methods

add_group(group_obj) Adds a group object
add_multigroups(multigroup_obj) Merges a MultiGroup object
bar_orthologs([output_file_name, dest, stats]) Creates a bar plot with the final ortholog values for each group file :param output_file_name: string.
basic_multigroup_statistics([output_file_name])
param output_file_name:
 
get_gnames()
get_group(group_id) Returns a group object based on its name.
group_overlap() This will find the overlap of orthologs between two group files.
iter_gnames()
remove_group(group_id) Removes a group object according to its name
update_filters(gn_filter, sp_filter[, ...]) This will not change the Group object themselves, only the filter mapping.
add_group(group_obj)[source]

Adds a group object :param group_obj: Group object

add_multigroups(multigroup_obj)[source]

Merges a MultiGroup object :param multigroup_obj: MultiGroup object

bar_orthologs(output_file_name='Final_orthologs', dest='./', stats='total')[source]

Creates a bar plot with the final ortholog values for each group file :param output_file_name: string. Name of output file :param dest: string. output directory :param stats: string. The statistics that should be used to generate the bar plot. Options are:

..: “1”: Total orthologs ..: “2”: Species compliant orthologs ..: “3”: Gene compliant orthologs ..: “4”: Final orthologs ..: “all”: All of the above Multiple combinations can be provided, for instance: “123” will display bars for total, species compliant and gene compliant stats
basic_multigroup_statistics(output_file_name='multigroup_base_statistics.csv')[source]
Parameters:output_file_name
Returns:
get_gnames()[source]
get_group(group_id)[source]

Returns a group object based on its name. If the name does not match any group object, returns None :param group_id: string. Name of group object

group_overlap()[source]

This will find the overlap of orthologs between two group files. THIS METHOD IS TEMPORARY AND EXPERIMENTAL

iter_gnames()[source]
remove_group(group_id)[source]

Removes a group object according to its name :param group_id: string, name matching a Group object name attribute

update_filters(gn_filter, sp_filter, group_names=None, default=False)[source]

This will not change the Group object themselves, only the filter mapping. The filter is only applied when the Group object is retrieved to reduce computations :param gn_filter: int, filter for max gene copies :param sp_filter: int, filter for min species :param group_names: list, with names of group objects

class trifusion.ortho.OrthomclToolbox.MultiGroupsLight(db_path, groups=None, gene_threshold=None, species_threshold=None, project_prefix='MyGroups', ns=None)[source]

Bases: object

Creates an object composed of multiple Group objects like MultiGroups. However, instead of storing the groups in memory, these are shelved in the disk

Methods

add_group(group_obj) Adds a group object
add_multigroups(multigroup_obj) Merges a MultiGroup object
bar_orthologs([group_names, ...]) Creates a bar plot with the final ortholog values for each group file :param group_names: list.
clear_groups() Clears the current MultiGroupsLight object
get_group(group_id) Returns a group object based on its name.
get_multigroup_statistics(group_obj)
return:
remove_group(group_id) Removes a group object according to its name
update_filters(gn_filter, sp_filter, ...[, ...]) This will not change the Group object themselves, only the filter mapping.
add_group(group_obj)[source]

Adds a group object :param group_obj: Group object

add_multigroups(multigroup_obj)[source]

Merges a MultiGroup object :param multigroup_obj: MultiGroup object

bar_orthologs(group_names=None, output_file_name='Final_orthologs', dest='./', stats='all')[source]

Creates a bar plot with the final ortholog values for each group file :param group_names: list. If None, all groups in self.group_stats will be used to generate the plot. Else, only the groups with the names in the list will be plotted. :param output_file_name: string. Name of output file :param dest: string. output directory :param stats: string. The statistics that should be used to generate the bar plot. Options are:

..: “1”: Total orthologs ..: “2”: Species compliant orthologs ..: “3”: Gene compliant orthologs ..: “4”: Final orthologs ..: “all”: All of the above Multiple combinations can be provided, for instance: “123” will display bars for total, species compliant and gene compliant stats
calls = ['bar_genecopy_distribution', 'bar_species_distribution', 'bar_species_coverage', 'bar_genecopy_per_species']
clear_groups()[source]

Clears the current MultiGroupsLight object

get_group(group_id)[source]

Returns a group object based on its name. If the name does not match any group object, returns None :param group_id: string. Name of group object

get_multigroup_statistics(group_obj)[source]
Returns:
remove_group(group_id)[source]

Removes a group object according to its name :param group_id: string, name matching a Group object name attribute

update_filters(gn_filter, sp_filter, excluded_taxa, group_names=None, default=False)[source]

This will not change the Group object themselves, only the filter mapping. The filter is only applied when the Group object is retrieved to reduce computations

Parameters:
  • gn_filter – int, filter for max gene copies
  • sp_filter – int, filter for min species
  • group_names – list, with names of group objects
exception trifusion.ortho.OrthomclToolbox.OrthoGroupException[source]

Bases: exceptions.Exception