lXtractor.core package
lXtractor.core.alignment module
A module handling multiple sequence alignments.
- class lXtractor.core.alignment.Alignment(seqs, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
Bases:
object
An MSA resource: a collection of aligned sequences.
- __init__(seqs, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
- Parameters:
seqs (Iterable[tuple[str, str]]) – An iterable with (id, _seq) pairs.
add_method (AddMethod) – A callable adding sequences. Check the type for a signature.
align_method (AlignMethod) – A callable aligning sequences.
- add(other)[source]
Add sequences to existing ones using
add()
. This is similar toalign()
but automatically adds the aligned seqs.>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')]) >>> aa = a.add(('Y', 'ABXD')) >>> aa.shape (3, 4)
- align(seq)[source]
Align (add) sequences to this alignment via
add_method
.>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')]) >>> aa = a.align(('Y', 'ABXD')) >>> aa.shape (1, 4) >>> aa.seqs [('Y', 'ABXD')]
- Parameters:
seq (abc.Iterable[_ST] | _ST | Alignment) – A sequence, iterable over sequences, or another
Alignment
.- Returns:
A new alignment object with sequences from _seq. The original number of columns should be preserved, which is true when using the default
add_method
.- Return type:
t.Self
- annotate(objs, map_name, accept_fn=None, **kwargs)[source]
This function “annotates” sequence segments using MSA.
Namely, it adds each sequence of the provided chain-type objects to sequences currently present in this MSA via
add_method
. The latter is expected to preserve the original number of MSA columns, whereas potentially cutting the original sequence, thereby defining MSA-imposed boundaries. These are used to extract a child object usingspawn_child
method, which will have the corresponding MSA numbering written under map_name.- Parameters:
objs (abc.Iterable[_CT]) – An iterable over chain-type objects.
map_name (str) – A name to use for storing the derived MSA numbering map.
accept_fn (abc.Callable[[_CT], bool] | None) – A function accepting a chain-type object and returning a boolean value indicating whether the spawn child sequence should be preserved.
kwargs – Additional keyword arguments passed to the
spawn_child()
method.
- Returns:
An iterator over spawned child objects. These are automatically stored under the
children
attribute of each chain-type object, in which case it’s safe to simply consume the returned iterator.
- filter_gaps(max_frac=1.0, dim=0)[source]
Filter sequences or alignment columns having >= max_frac of gaps.
>>> a = Alignment([('A', 'AB---'), ('X', 'XXXX-'), ('Y', 'YYYY-')])
By default, the max_frac gaps is 1.0, which would remove solely gap-only sequences.
>>> aa = a.filter_gaps(dim=0) >>> aa == a True
Specifying max_frac removes sequences with over 50% gaps.
>>> aa = a.filter_gaps(dim=0, max_frac=0.5) >>> 'A' not in aa True
The last column is removed.
>>> a.filter_gaps(dim=1).shape (3, 4)
- Parameters:
max_frac (float) – a maximum fraction of allowed gaps in a sequence or a column.
dim (int) –
0
for sequences,1
for columns.
- Returns:
A new
Alignment
object with filtered sequences or columns.- Return type:
t.Self
- itercols(*, join=True)[source]
Iterate over the Alignment columns.
>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')]) >>> list(a.itercols()) ['AX', 'BX', 'CX', 'DX']
- Parameters:
join (bool) – Join columns into a string.
- Returns:
An iterator over columns.
- Return type:
Iterator[str] | Iterator[list[str]]
- classmethod make(seqs, method=<function mafft_align>, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
Create a new alignment from a collection of unaligned sequences. For aligned sequences, please utilize
read()
.- Parameters:
seqs (Iterable[tuple[str, str]]) – An iterable over (header, _seq) objects.
method (AlignMethod) – A callable accepting unaligned sequences and returning the aligned ones.
add_method (AddMethod) – A sequence addition method for a new
Alignment
object.align_method (AlignMethod) – An alignment method for a new
Alignment
object.
- Returns:
An alignment created from aligned seqs.
- Return type:
- map(fn)[source]
Map a function to sequences.
>>> a = Alignment([('A', 'AB---')]) >>> a.map(lambda x: (x[0].lower(), x[1].replace('-', '*'))).seqs [('a', 'AB***')]
- classmethod read(inp, read_method=<function read_fasta>, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
Read sequences and create an alignment.
- Parameters:
inp (Path | TextIOBase | abc.Iterable[str]) – A Path to aligned sequences, or a file handle, or iterable over file lines.
read_method (SeqReader) – A method accepting inp and returning an iterable over pairs (header, _seq). By default, it’s
read_fasta()
. Hence, the default expected format is fasta.add_method (AddMethod) – A sequence addition method for a new
Alignment
object.align_method (AlignMethod) – An alignment method for a new
Alignment
object.
- Returns:
An alignment with sequences read parsed from the provided input.
- Return type:
t.Self
- classmethod read_make(inp, read_method=<function read_fasta>, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
A shortcut combining
read()
andmake()
.- It parses sequences from inp, aligns them and creates
the
Alignment
object.
- Parameters:
inp (Path | TextIOBase | abc.Iterable[str]) – A Path to aligned sequences, or a file handle, or iterable over file lines.
read_method (SeqReader) – A method accepting inp and returning an iterable over pairs (header, _seq). By default, it’s
read_fasta()
. Hence, the default expected format is fasta.add_method (AddMethod) – A sequence addition method for a new
Alignment
object.align_method (AlignMethod) – An alignment method for a new
Alignment
object.
- Returns:
An alignment from parsed and aligned inp sequences.
- Return type:
t.Self
- realign()[source]
Realign sequences in
seqs
usingalign_method
.- Returns:
A new
Alignment
object with realigned sequences.
- remove(item, error_if_missing=True, realign=False)[source]
Remove a sequence or collection of sequences.
>>> a = Alignment([('A', 'ABCD-'), ('X', 'XXXX-'), ('Y', 'YYYYY')]) >>> aa = a.remove('A') >>> 'A' in aa False >>> aa = a.remove(('Y', 'YYYYY')) >>> aa.shape (2, 5) >>> aa = a.remove(('Y', 'YYYYY'), realign=True) >>> aa.shape (2, 4) >>> aa['A'] 'ABCD' >>> aa = a.remove(['X', 'Y']) >>> aa.shape (1, 5)
- Parameters:
item (str | _ST | t.Iterable[str] | t.Iterable[_ST]) –
One of the following:
A
str
: a sequence’s name.A pair
(str, str)
– a name with the sequence itself.An iterable over sequence enames or pairs (not mixed!)
error_if_missing (bool) – Raise an error if any of the items are missing.
realign (bool) – Realign seqs after removal.
- Returns:
A new
Alignment
object with the remaining sequences.- Return type:
t.Self
- slice(start, stop, step=None)[source]
Slice alignment columns.
>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')]) >>> aa = a.slice(1, 2) >>> aa.shape == (2, 2) True >>> >>> aa.seqs[0] ('A', 'AB') >>> aa = a.slice(-4, 10) >>> aa.seqs[0] ('A', 'ABCD')
To add the aligned sequences to the existing ones, use
+
oradd()
:>>> aaa = a + aa >>> aaa.shape (3, 4)
- Parameters:
start (int) – Start coordinate, boundaries inclusive.
stop (int) – Stop coordinate, boundaries inclusive.
step (int | None) – Step for slicing, i.e., take every column separated by step - 1 number of columns.
- Returns:
A new alignment with sequences subset according to the slicing params.
- Return type:
t.Self
- write(out, write_method=<function write_fasta>)[source]
Write an alignment.
- Parameters:
out (Path | SupportsWrite) – Any object with the write method.
write_method (SeqWriter) – The writing function itself, accepting sequences and out. By default, use read_fasta to write in fasta format.
- Returns:
Nothing.
- Return type:
None
- align_method: AlignMethod
- seqs: list[tuple[str, str]]
- property shape: tuple[int, int]
- Returns:
(# sequences, # columns)
lXtractor.core.base module
Base classes, commong types and functions for the core module.
- class lXtractor.core.base.AbstractResource(resource_path, resource_name)[source]
Bases:
object
Abstract base class defining basic interface any resource must provide.
- class lXtractor.core.base.AddMethod(*args, **kwargs)[source]
Bases:
Protocol
A callable to add sequences to the aligned ones, preserving the alignment length.
- __init__(*args, **kwargs)
- class lXtractor.core.base.AlignMethod(*args, **kwargs)[source]
Bases:
Protocol
A callable to align arbitrary sequences.
- __init__(*args, **kwargs)
- class lXtractor.core.base.ApplyT(*args, **kwargs)[source]
Bases:
Protocol
[T
]- __init__(*args, **kwargs)
- class lXtractor.core.base.ApplyTWithArgs(*args, **kwargs)[source]
Bases:
Protocol
[T
]- __init__(*args, **kwargs)
- class lXtractor.core.base.FilterT(*args, **kwargs)[source]
Bases:
Protocol
[T
]- __init__(*args, **kwargs)
- class lXtractor.core.base.NamedTupleT(*args, **kwargs)[source]
Bases:
Protocol
,Iterable
- __init__(*args, **kwargs)
- class lXtractor.core.base.Ord(*args, **kwargs)[source]
Bases:
Protocol
[_T
]Any objects defining comparison operators.
- __init__(*args, **kwargs)
- class lXtractor.core.base.ResNameDict[source]
Bases:
UserDict
A dictionary providing mapping between PDB residue names and their one-letter codes. The mapping was parsed from the CCD and can be obtained by calling
lXtractor.ext.ccd.CCD.make_res_name_map()
.>>> d = ResNameDict() >>> assert d['ALA'] == 'A'
- class lXtractor.core.base.SeqFilter(*args, **kwargs)[source]
Bases:
Protocol
A callable accepting a pair (header, _seq) and returning a boolean.
- __init__(*args, **kwargs)
- class lXtractor.core.base.SeqMapper(*args, **kwargs)[source]
Bases:
Protocol
A callable accepting and returning a pair (header, _seq).
- __init__(*args, **kwargs)
- class lXtractor.core.base.SeqReader(*args, **kwargs)[source]
Bases:
Protocol
A callable reading sequences into tuples of (header, _seq) pairs.
- __init__(*args, **kwargs)
- class lXtractor.core.base.SeqWriter(*args, **kwargs)[source]
Bases:
Protocol
A callable writing (header, _seq) pairs to disk.
- __init__(*args, **kwargs)
lXtractor.core.config module
A module encompassing various settings of lXtractor objects.
- class lXtractor.core.config.AtomMark(value)[source]
Bases:
IntFlag
The atom categories. Some categories may be combined, e.g., LIGAND | PEP is another valid category denoting ligand peptide atoms.
- CARB: int = 32
Carbohydrate polymer atoms.
- COVALENT: int = 64
Covalent polymer modifications including ligands.
- LIGAND: int = 4
Ligand atom. If not combined with PEP, NUC, or CARB, this category denotes non-polymer (small molecule) single-residue ligands.
- NUC: int = 16
Nucleotide polymer atoms.
- PEP: int = 8
Peptide polymer atoms.
- SOLVENT: int = 2
Solvent atom.
- UNK: int = 1
Unknown atom.
- class lXtractor.core.config.Config(default_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/default_config.json'), user_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/user_config.json'))[source]
Bases:
UserDict
A configuration management class.
This class facilitates the loading and saving of configuration settings, with a user-specified configuration overriding the default settings.
- Parameters:
default_config_path (str | Path) – The path to the default config file. This is a reference default settings, which can be used to reset user settings if needed.
user_config_path (str | Path) – The path to the user configuration file. This file is stored internally and can be modified by a user to provide permanent settings.
Loading and mofifying the config:
>>> cfg = Config() >>> list(cfg.keys())[:2] ['bonds', 'colnames'] >>> cfg['bonds']['non_covalent_upper'] 5.0 >>> cfg['bonds']['non_covalent_upper'] = 6
Equivalently, one can update the config by a local JSON file or dict:
>>> cfg.update_with({'bonds': {'non_covalent_upper': 4}}) >>> assert cfg['bonds']['non_covalent_upper'] == 4
The changes can be stored internally and loaded automatically in the future:
>>> cfg.save() >>> cfg = Config() >>> assert cfg['bonds']['non_covalent_upper'] == 4
To restore default settings:
>>> cfg.reset_to_defaults() >>> cfg.clear_user_config()
- __init__(default_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/default_config.json'), user_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/user_config.json'))[source]
- save(user_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/user_config.json'))[source]
Save the current configuration. By default, will store the configuration internally. This stored configuration will be loaded automatically on top of the default configuration.
- Parameters:
user_config_path (str | Path) – The path where to save the user configuration file.
- Raises:
ValueError – If the user config path is not provided.
- temporary_namespace()[source]
A context manager for a temporary config namespace.
Within this context, changes to the config are allowed, but will be reverted back to the original config once the context is exited.
Example:
>>> cfg = Config() >>> with cfg.temporary_namespace(): ... cfg['bonds']['non_covalent_upper'] = 10 ... # Do some stuff with the temporary config... ... # Config is reverted back to original state here >>> assert cfg['bonds']['non_covalent_upper'] != 10
lXtractor.core.exceptions module
- exception lXtractor.core.exceptions.ConfigError[source]
Bases:
ValueError
Some configuration problem.
lXtractor.core.ligand module
- class lXtractor.core.ligand.Ligand(parent, mask, contact_mask, ligand_idx, dist, meta=None)[source]
Bases:
object
Ligand object is a part of the structure falling under certain criteria.
Namely, a ligand is a non-polymer and non-solvent molecule or a single monomer. Such ligands will be designated using the format:
{res_name}_{res_id}:{chain_id}<-({parent})
If a ligand contains multiple monomers, by convention, this is a polymer ligand. Such ligands should be named using the first letter of the polymer type; one of the
("p", "n", "c")
. In this case, it’s ID will be of the following format:{polymer_type}_{min_res_id}-{max_res_id}:{chain_id}<-({parent})
This information is provided by
meta
and shouldn’t be changed. However, any additional fields can be stored inmeta
which will be retrieved when constructingsummary()
.Attributes
mask
andcontact_mask
are boolean masks allowing to obtain ligand and ligand-contacting atoms fromparent
.- ..seealso ::
make_ligand()
to initialize a new ligand in an easy way.
- is_locally_connected(mask)[source]
Check whether this ligand is connected to a subset of parent atoms.
- Parameters:
mask (ndarray) – A boolean mask to filter parent atoms.
- Returns:
True
if the ligand has at least min_atom_connections toparent
substructure imposed by the provided mask.- Return type:
bool
- property chain_id: str
- Returns:
Ligand chain ID.
- contact_mask: np.ndarray
A boolean mask such that when applied to the parent, subsets the latter to its ligand-contacting atoms.
- dist
An array of distances for each ligand-contacting parent’s atom.
- property id: str
- is_polymer
- ligand_idx
An integer array with indices pointing to ligand atoms contacting the parent structure.
- mask
A boolean mask such that when applied to the parent, subsets the latter to the ligand residues.
- meta
A dictionary of meta info.
- parent: GenericStructure
Parent structure.
- property parent_contact_atoms: AtomArray
- Returns:
An array of ligand-contacting atoms within
parent
.
- property parent_contact_chains: set[str]
- Returns:
A set of chain IDs involved in forming contacts with ligand.
- property res_id: str
- Returns:
Ligand residue number.
- property res_name: str
- Returns:
Ligand residue name.
- lXtractor.core.ligand.ligands_from_atom_marks(structure)[source]
- Return type:
abc.Generator[Ligand, None, None]
- lXtractor.core.ligand.make_ligand(m_lig, m_pol, structure)[source]
Create a new
Ligand
object. The criteria to qualify for a ligand are defined by the global config (DefaultConfig["ligand"]
).Whether a ligand molecule is created is subject to several checks:
#. It has a certain number of atoms. #. It has a certain number of contacts with the polymer. #. It contacts a certain number of residues in the polymer. #. Its atoms span a single chain.
If a ligand doesn’t pass any of these checks, the function returns
None
.- Parameters:
m_lig (npt.NDArray[np.bool_]) – A boolean mask pointing to putative ligand atoms.
m_pol (npt.NDArray[np.bool_]) – A boolean mask pointing to polymer atoms that supposedly contact ligand atoms.
structure (GenericStructure) – A parent structure to which the masks can be applied.
- Returns:
An instantiated ligand or
None
if the checks were not passed.- Return type:
Ligand | None
lXtractor.core.pocket module
The module defines Pocket
, representing an arbitrarily defined
binding pocket.
- class lXtractor.core.pocket.Pocket(definition, name='Pocket')[source]
Bases:
object
A binding pocket.
The pocket is defined via a single string following a particular syntax (a definition), such that, when applied to a ligand using
is_connected()
, the latter outputsTrue
if ligand is connected. Consequently, it is tightly bound tolXtractor.core.ligand.Ligand
. Namely, the definition relies on two matrices:“c” =
lXtractor.core.ligand.Ligand.contact_mask
(boolean mask)“d” =
lXtractor.core.ligand.Ligand.dist
(distances)
The definition is a combination of statements. Each statement involves the selection consisting of a matrix (“c” or “d”), residue positions, and residue atom names, formatted as:
{matrix-prefix}:[pos]:[atom_names] {sign} {number}
where
[pos]
and[atom_names]
can be comma-separated lists,sign
is` a comparison operator, and anumber
(int
orfloat
) is what to compare to. For instance, selectionc:1:CA,CB == 2
translates into “must have exactly two contacts with atoms “CA” and “CB” at position 1. See more examples below.Comparison meaning depends on the matrix type used: “c” or “d”.
In the former case,
>= x
means “at least x contacts”. In the latter case, “<= x” means “have distance below x”.In the case of the “d” matrix, applying selection and comparison will result in a vector of
bool
bool values, requiring an aggregation. Two aggregation types are supported: “da” (any) and “daa” (all).In the case of the “c” matrix, possible matrix prefixes are “c” and “cs”. They have very different meanings! In the former case, the statements compares the total number of contacts when the selection is applied. In the latter case, the statement will select residues separately and, for each residue, decide whether the selected atoms form enough contact to extrapolate towards the full residue and mark it as “contacting” (controlled via min_contacts). These decisions are summed across each residue and this sum is compared to the number in the statement. See the example below.
Finally, statements can be bracketed and combined by boolean operators “AND” and “OR” (which one can abbreviate by “&” and “|”).
Examples:
At least two contacts with any atom of residues 1 and 5:
c:1,5:any >= 2
Note that the above is a “cumulative” statement, i.e., it is applied to both residues at the same time. Thus, if a residue 1 has two atoms contacting a ligand while a residue 2 has none, this will still evaluate to
True
. The following definition will ensure that each residue has at least two contacts:c:1:any >= 2 & c:2:any >= 2
In contrast, the following statement will translate “among residues 1, 2, and 3, there are at least two “contacting” residues:
cs:1,2,3:any >= 2
Any atoms farther than 10A from alpha-carbons of positions 1 and 10:
da:1,10:CA > 10
Any atoms with at least two contacts with any atoms at position 1 or all CA atoms closer than 6A of positions 2 and 3:
c:1:any >= 2 | daa:2,3:CA < 6
CA or CB atoms with a contact at position 1 but not 2, while position 3 has any atoms below 10A threshold:
c:1:CA,CB >= 1 & c:2:CA,CB == 0 & da:3:any <= 10
Contact with positions 1 and 2 or positions 3 and 4:
(c:1:any >= 1 & c:2:any >= 1) | (c:3:any >= 1 & c:4:any >= 1)
See also
- is_connected(ligand, mapping=None, **kwargs)[source]
Check whether a ligand is connected.
- Parameters:
ligand (Ligand) – An arbitrary ligand.
mapping (dict[int, int] | None) – A mapping to the ligand’s parent structure numbering.
kwargs – Passed to
translate_definition()
.
- Returns:
True
if the ligand is bound within the pocket andFalse
otherwise.- Return type:
bool
- definition
- name
- lXtractor.core.pocket.make_sel(pos, atoms)[source]
Make a selection string from positions and atoms.
>>> make_sel(1, 'any') '(a.res_id == 1)' >>> make_sel([1, 2], 'CA,CB') "np.isin(a.res_id, [1, 2]) & np.isin(a.atom_name, ['CA', 'CB'])"
- Parameters:
pos (int | Sequence[int]) –
atoms (str) –
- Returns:
- Return type:
str
- lXtractor.core.pocket.translate_definition(definition, mapping=None, *, skip_unmapped=False, min_contacts=1)[source]
Translates the
Pocket.definition
into a series of statements, such that, when applied to ligand matrices, evaluate tobool
.>>> translate_definition("c:1:any > 1") '(c[np.isin(a.res_id, [1])].sum() > 1)' >>> translate_definition("da:1,2:CA,CZ <= 6") "(d[np.isin(a.res_id, [1, 2]) & np.isin(a.atom_name, ['CA', 'CZ'])] <= 6).any()" >>> translate_definition("daa:1,2:any > 2", {1: 10}, skip_unmapped=True) '(d[np.isin(a.res_id, [10])] > 2).all()' >>> translate_definition("cs:1,2:any > 2") 'sum([c[(a.res_id == 1)].sum() >= 1, c[(a.res_id == 2)].sum() >= 1]) > 2'
Warning
skip_unmapped=True
may change the pocket’s definition and lead to undesired conclusions. Caution advised!- Parameters:
definition (str) – A string definition of a
Pocket
.mapping (dict[int, int] | None) – An optional mapping from the definition’s numbering to a structure’s numbering.
skip_unmapped (bool) – If the mapping is provided and some position is left unmapped, skip this position.
min_contacts (int) – If prefix is “cs”, use this threshold to determine a minimum number of residue contacts required to consider it bound.
- Returns:
A new string with statements of the provided definition translated into a numpy syntax.
- Return type:
str
lXtractor.core.segment module
Module defines a segment object serving as base class for sequences in lXtractor.
- class lXtractor.core.segment.Segment(start, end, name='S', seqs=None, parent=None, children=None, meta=None, variables=None)[source]
Bases:
Sequence
[NamedTupleT
]An arbitrary segment with inclusive boundaries containing arbitrary number of sequences.
Sequences themselves may be retrieved via
[]
syntax:>>> s = Segment(1, 10, 'S', seqs={'X': list(range(10))}) >>> s.id == 'S|1-10' True >>> s['X'] == list(range(10)) True >>> 'X' in s True
One can use the same syntax to check if a Segment contains certain index:
>>> 1 in s and 10 in s and not 11 in s True
Iteration over the segment yields it’s items:
>>> next(iter(s)) Item(i=1, X=0)
One can just get the same item by explicit index:
>>> s[1] Item(i=1, X=0)
Slicing returns an iterable slice object:
>>> list(s[1:2]) [Item(i=1, X=0), Item(i=2, X=1)]
One can add a new sequence in two ways.
using a method:
>>> s.add_seq('Y', tuple(range(10, 20))) >>> 'Y' in s True
using
[]
syntax:
>>> s['Y'] = tuple(range(10, 20)) >>> 'Y' in s True
Note that using the first method, if
s
already containsY
, this will cause an exception. To overwrite a sequence with the same name, please use explicit[]
syntax.Additionally, one can offset Segment indices using
>>
/<<
syntax. This operation mutates original Segment!>>> s >> 1 S|2-11 >>> 11 in s True
- __init__(start, end, name='S', seqs=None, parent=None, children=None, meta=None, variables=None)[source]
- Parameters:
start (int) – Start coordinate.
end (int) – End coordinate.
name (str) – The name of the segment. Name with start and end coordinates should uniquely specify the segment. They are used to dynamically construct
id()
.seqs (dict[str, abc.Sequence[t.Any]] | None) – A dictionary name => sequence, where sequence is some sequence (preferably mutable) bounded by segment. Name of a sequence must be “simple”, i.e., convertable to a field of a namedtuple.
parent (t.Self | None) – Parental segment bounding this instance, typically obtained via
sub()
orsub_by()
methods.children (abc.MutableSequence[t.Self] | None) – A mapping name =>
Segment
with child segments bounded by this instance.meta (dict[str, t.Any] | None) – A dictionary with any meta-information str() => str() since reading/writing meta to disc will inevitably convert values to strings.
variables (Variables | None) – A collection of variables calculated or staged for calculation for this segment.
- add_seq(name, seq)[source]
Add sequence to this segment.
- Parameters:
name (str) – Sequence’s name. Should be convertible to the namedtuple’s field.
seq (Sequence[Any]) – A sequence with arbitrary elements and the length of a segment.
- Returns:
returns nothing. This operation mutates attr:`seqs.
- Raises:
ValueError – If the name is reserved by another segment.
- Return type:
None
- append(other, filler=<function Segment.<lambda>>, joiner=<built-in function add>)[source]
Append another segment to this one.
The encompassed sequences will be merged together by joiner. If a sequence is missing in this segment or other, filler will create a sequence with filled values. The sequences will be deep-copied before merge.
>>> a = Segment(1, 3, "A", seqs={"A": "AAA"}) >>> b = Segment(1, 2, "B", seqs={"B": "BB"}) >>> c = a.append(b, filler=lambda x: '*' * x) >>> c.id 'A|1-5' >>> c['A'] 'AAA**' >>> c['B'] '***BB'
Note that the same can be achieved via
|
operator:>>> a | b == a.append(b, filler=lambda x: '*' * x) True
This will use
"*"
filler forstr
-type sequences andNone
for the rest and use the default joiner for joining them.Note
Appending to an empty segment will return other. Appending an empty segment will return this segment.
Warning
Appending creates a new segment and removes associated parent and metadata
- Parameters:
other (t.Self) – Another arbitrary segment.
filler (_Filler | abc.Mapping[str, _Filler]) – A callable accepting the positive integer and returning a filled in a sequence or a
dict
mapping sequence names to such callables.joiner (_Joiner | abc.Mapping[str, _Joiner]) – A callable accepting two sequences and returning a merged sequence or a
dict
mapping sequence names to such callables.
- Returns:
A new segment with the same name as this segment, extended by other.
- Return type:
t.Self
- bounded_by(other)[source]
Check whether this segment is bounded by other.
self: +----+ other: +------+ => True
:param other; Another segment.
- Return type:
bool
- bounds(other)[source]
Check if this segment bounds other.
self: +-------+ other: +----+ => True
:param other; Another segment.
- Return type:
bool
- insert(other, i, **kwargs)[source]
Insert a segment into this one.
The function splits this segment into two parts at the provided index and insert other between them via
append()
. The latter handles common/unique sequences via filler and joiner arguments, which can be passed here as keyword arguments.Note
Inserting an empty segment returns this instance. Inserting a segment at the
end()
appends other.Warning
Inserting creates a new segment and removes associated parent and metadata
- Parameters:
other (t.Self) – Another segment to insert.
i (int) – Index to insert at. The insertion will be performed after i.
kwargs – Passed to
append()
.
- Returns:
A new segment with inserted other.
- Raises:
IndexError – If attempting to insert at an invalid index. Only indices
start < i <= end
are valid.- Return type:
t.Self
- overlap(start, end)[source]
Create new segment from the current instance using overlapping boundaries.
- Parameters:
start (int) – Starting coordinate.
end (int) – Ending coordinate.
- Returns:
New overlapping segment with
data
andname
- Return type:
t.Self
- overlap_with(other, deep_copy=True, handle_mode='merge', sep='&')[source]
Overlap this segment with other over common indices.
self: +---------+ other: +-------+ =>: +-----+
- Parameters:
deep_copy (bool) – deepcopy seqs to avoid side effects.
handle_mode (str) –
When the child overlapping segment is created, this parameter defines how
name
andmeta
are handled. The following values are possible:”merge”: merge meta and name from self and other
”self”: the current instance provides both attributes
”other”: other provides both attributes
sep (str) – If handle_mode == “merge”, the new name is created by joining names of self and other using this separator.
- Returns:
New segment instance with inherited name and meta.
- Return type:
t.Self
- overlaps(other)[source]
Check whether a segment overlaps with the other segment. Use
overlap_with()
to produce an overlapping childSegment
.
- remove_seq(name)[source]
Remove sequence from this segment.
- Parameters:
name (str) – Sequence’s name. If doesn’t exist in this segment, nothing happens.
- sub(start, end, **kwargs)[source]
Subset current segment using provided boundaries. Will create a new segment and call
sub_by()
.- Parameters:
start (int) – new start.
end (int) – new end.
kwargs – passed to
overlap_with()
- Return type:
t.Self
- sub_by(other, **kwargs)[source]
A specialized version of
overlap_with()
used in cases where other is assumed to be a part of the current segment (hence, a subsegment).- Parameters:
other (Segment) – Some other segment contained within the (start, end) boundaries.
kwargs – Passed to
overlap_with()
.
- Returns:
A new
Segment
object with boundaries of other. Seeoverlap_with()
on how to handle segments’ names and data.- Raises:
NoOverlap – If other’s boundaries lie outside the existing
start
,end
.- Return type:
t.Self
- children
- property end: int
- Returns:
A Segment’s end coordinate.
- property id: str
- Returns:
Unique segment’s identifier encapsulating name, boundaries and parents of a segment if it was spawned from another
Segment
instance. For example:S|1-2<-(P|1-10)
would specify a segment S with boundaries
[1, 2]
descended from P.
- property is_empty: bool
- Returns:
True
if the segment is empty. Emptiness is a special case, in whichSegment
has start == end == 0
.
- property is_singleton: bool
- Returns:
True
if the segment contains a single element. In this special case,start == end
.
- property item_type: _Item
A factory to make an Item namedtuple object encapsulating sequence names contained within this instance. The first field is reserved for “i” – an index. :return: Item namedtuple object.
- meta: dict[str, t.Any]
- property name: str
- property parent: t.Self | None
- property seq_names: list[str]
- Returns:
A list of sequence names this segment entails.
- property start: int
- Returns:
A Segment’s start coordinate.
- lXtractor.core.segment.do_overlap(segments)[source]
Check if any pair of segments overlap.
- Parameters:
segments (Iterable[Segment]) – an iterable with at least two segments.
- Returns:
True
if there are overlapping segments,False
otherwise.- Return type:
bool
- lXtractor.core.segment.map_segment_numbering(segments_from, segments_to)[source]
Create a continuous mapping between the numberings of two segment collections. They must contain the same number of equal length non-overlapping segments. Segments in the segments_from collection are considered to span a continuous sequence, possibly interrupted due to discontinuities in a sequence represented by segments_to’s segments. Hence, the segments in segments_from form continuous numbering over which numberings of segments_to segments are joined.
- Parameters:
- Returns:
An iterable over (key, value) pairs. Keys correspond to numberings of the segments_from, values – to numberings of segments_to.
- Return type:
Iterator[tuple[int, int | None]]
- lXtractor.core.segment.resolve_overlaps(segments, value_fn=<built-in function len>, max_it=None, verbose=False)[source]
Eliminate overlapping segments.
Convert segments into and undirected graph (see
segments2graph()
). Iterate over connected components. If a component has only a single node (no overlaps§), yield it. Otherwise, consider all possible non-overlapping subsets of nodes. Find a subset such that the sum of the value_fn over the segments is maximized and yield nodes from it.- Parameters:
segments (Iterable[Segment]) – A collection of possibly overlapping segments.
value_fn (Callable[[Segment], float]) – A function accepting the segment and returning its value.
max_it (int | None) – The maximum number of subsets to consider when resolving a group of overlapping segments.
verbose (bool) – Progress bar and general info.
- Returns:
A collection of non-overlapping segments with maximum cumulative value. Note that the optimal solution is guaranteed iff the number of possible subsets for an overlapping group does not exceed max_it.
- Return type:
Generator[Segment, None, None]
- lXtractor.core.segment.segments2graph(segments)[source]
Convert segments to an undirected graph such that segments are nodes and edges are drawn between overlapping segments.
- Parameters:
segments (Iterable[Segment]) – an iterable with segments objects.
- Returns:
an undirected graph.
- Return type:
Graph
lXtractor.core.structure module
Module defines basic interfaces to interact with macromolecular structures.
- class lXtractor.core.structure.CarbohydrateStructure(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]
Bases:
GenericStructure
A structure type where primary polymer is carbohydrate.
See also
GenericStructure
for general-purpose documentation.
- class lXtractor.core.structure.GenericStructure(array, name, ligands=None, atom_marks=None, graph=None)[source]
Bases:
object
A generic macromolecular structure with possibly many chains holding a single
biotite.structure.AtomArray
instance.This object is a core data structure in lXtractor for structural data.
The object is considered immutable: atoms of a structure can’t change their location or properties, as well as other protected attributes.
While atoms are stored as
biotite.structure.AtomArray
, GenericStructure defines additional annotations for each atom and operations crucial for other objects such aslXtractor.core.chain.ChainStructure
.Upon initialization, atom array attains graph representation (
graph()
) usinglXtractor.util.structure.to_graph()
function. Using this representation, atom annotations are attained via :func``mark_atoms_g`. These annotations can be accessed viaatom_marks()
. For convenience, boolean masks are stored and can be applied to thearray()
as follows:# Assume ``s`` is a :class:`GenericStructure` object. s[s.mask.`mask_name`]
To view available mask names, see
Masks
.One of the most crucial annotations is the so-called “primary_polymer”. These atoms serve as a frame of reference for all other atoms in a structure. The rest of the atoms are categorized as either ligand or solvent. Sometimes the annotation process fails to identify certain atoms. In such cases, a warning is logged. To view uncategorized atoms, one can use the following mask:
s[s.mask.unk]
Note
Using
__getitem__(item)
like ins[s.mask.unk
will return an atom array. Usesubset()
to obtain a new generic structure or initialize a new ``GenericStructure(s[s.mask.unk] instance; it will be equivalent.Methods
__repr__
and__str__
output a string in the format:{_name}:{polymer_chain_ids};{ligand_chain_ids}|{altloc_ids}
where*ids
are “,”-separated.- __init__(array, name, ligands=None, atom_marks=None, graph=None)[source]
- Parameters:
array (AtomArray) – Atom array object.
name (str) – ID of a structure in array.
ligands (Sequence[Ligand] | None) – A list of ligands or flag indicating to extract ligands during initialization.
- extract_positions(pos, chain_ids=None, **kwargs)[source]
Extract specific positions from this structure.
- Parameters:
pos (abc.Sequence[int]) – A sequence of positions (res_id) to extract.
chain_ids (abc.Sequence[str] | str | None) – Optionally, a single chain ID or a sequence of such.
kwargs – Passed to
subset()
.
- Returns:
A new instance with extracted residues.
- Return type:
t.Self
- extract_segment(start, end, chain_id, **kwargs)[source]
Create a sub-structure encompassing some continuous segment bounded by existing position boundaries.
- Parameters:
start (int) – Residue number to start from (inclusive).
end (int) – Residue number to stop at (inclusive).
chain_id (str) – Chain to extract a segment from.
kwargs – Passed to
subset()
.
- Returns:
A new Generic structure with residues in
[start, end]
.- Return type:
t.Self
- get_sequence()[source]
- Returns:
A generator over tuples, where each residue is described by: (1) one-letter code, (2) three-letter code, (3) residue number.
- Return type:
Generator[tuple[str, str, int]]
- classmethod make_empty(structure_id='XXXX')[source]
- Parameters:
structure_id (str) – (Optional) ID of the created array.
- Returns:
An instance with empty
array()
.- Return type:
t.Self
- classmethod read(inp, path2id=<function GenericStructure.<lambda>>, structure_id='XXXX', altloc=False, **kwargs)[source]
Parse the atom array from the provided input and wrap it into the
GenericStructure
object.Note
If inp is not a
Path
,kwargs
must contain the correctfmt
(e.g.,fmt=cif
).- Parameters:
inp (IOBase | Path | str | bytes) – Path to a structure in supported format.
path2id (abc.Callable[[Path], str]) – A callable obtaining a PDB ID from the file path. By default, it’s a
Path.stem
.structure_id (str) – A structure unique identifier (e.g., PDB ID). If not provided and the input is
Path
, will usepath2id
to infer the ID. Otherwise, will use a constant placeholder.altloc (bool | str) – Parse alternative locations and populate
array.altloc_id
attribute.kwargs – Passed to
load_structure
.
- Returns:
Parsed structure.
- Return type:
t.Self
- rm_solvent(copy=False)[source]
- Parameters:
copy (bool) – Copy the resulting substructure.
- Returns:
A substructure with solvent molecules removed.
- split_altloc(**kwargs)[source]
Split into substructures based on altloc IDs. Atoms missing altloc annotations are distributed into every substructure. Thus, even if a structure contains a single atom having altlocs (say, A and B), this method will produce two substructed identical except for this atom.
Note
If
array()
does not specify any altloc ID, the method yieldsself
.- Parameters:
kwargs – Passed to
subset()
.- Returns:
An iterator over objects of the same type initialized by atoms having altloc annotations.
- Return type:
abc.Iterator[t.Self]
- split_chains(polymer=False, **kwargs)[source]
Split into separate chains. Splitting is done using
biotite.structure.get_chain_starts()
.Note
Preserved ligands may have a different
chain_id
.Note
If there is a single chain, this method will return
self
.
- subset(mask, ligands=True, reinit_ligands=False, copy=False)[source]
Create a sub-structure potentially preserving connected
ligands()
.Warning
If
DefaultConfig["structure"]["primary_pol_type"]
is set to auto, and mask points to a polymer that is shorter than some existing ligand polymer, this ligand polymer will become a primary polymer in the substructure.- Parameters:
mask (np.ndarray) – Boolean mask,
True
for atoms inarray()
, used to create a sub-structure.ligands (bool) – Keep ligands that are connected to atoms specified by mask.
reinit_ligands (bool) – Reinitialize ligands upon creating a sub-structure, rather than filtering existing ligands connected to atoms specified by mask. Takes precedence over the ligands option. This option is used in
split_altloc()
.copy (bool) – Copy the atom array resulting from subsetting the original one.
- Returns:
A new instance with atoms defined by mask and connected ligands.
- Return type:
t.Self
- superpose(other, res_id_self=None, res_id_other=None, atom_names_self=None, atom_names_other=None, mask_self=None, mask_other=None)[source]
Superpose other structure to this one. Arguments to this function all serve a single purpose: to correctly subset both structures so the resulting selections have the same number of atoms.
The subsetting achieved either by specifying residue numbers and atom names or by supplying a binary mask of the same length as the number of atoms in the structure.
- Parameters:
other (GenericStructure | AtomArray) – Other
GenericStructure
or atom array.res_id_self (Iterable[int] | None) – Residue numbers to select in this structure.
res_id_other (Iterable[int] | None) – Residue numbers to select in other structure.
atom_names_self (Iterable[Sequence[str]] | Sequence[str] | None) – Atom names to select in this structure given either per-residue or as a single sequence broadcasted to selected residues.
atom_names_other (Iterable[Sequence[str]] | Sequence[str] | None) – Same as self.
mask_self (ndarray | None) – Binary mask to select atoms. Takes precedence over other selection arguments.
mask_other (ndarray | None) – Same as self.
- Returns:
A tuple of (1) an other structure superposed onto this one, (2) an RMSD of the superposition, and (3) a transformation that had been used with
biotite.structure.superimpose_apply()
.- Return type:
tuple[GenericStructure, float, tuple[ndarray, ndarray, ndarray]]
- write(path, atom_marks=True, graph=True)[source]
Save this structure to a file. The format is automatically determined from the given path.
Additional files are saved using the same filename alongside the structure file. The filename will resolve to “structure” in all the following cases and result in “structure.npy” and “structure.json” files saved to the same dir:
path="/path/to/structure.pdb" path="/path/to/structure.mmtf.gz" path="/path/to/structure.with.many.dots.pdb.gz"
- Parameters:
path (PathLike | str) – A path or a path-like object compatible with
open()
. Must not point to an existing directory. Must provide the structure format as an extension.atom_marks (bool) – Save an array of atom marks in the npy format.
graph (bool) – Save molecular connectivity graph in the json format.
- Returns:
Path to the saved structure if writing was successful.
- Return type:
Path
- property altloc_ids: list[str]
- Returns:
A sorted list of altloc IDs. If none found, will output
[""]
.
- property array: AtomArray
- Returns:
Atom array object.
- property atom_marks: ndarray[Any, dtype[int64]]
- Returns:
An array of
lXtractor.core.config.AtomMark
marks, categorizing each atom in this structure.
- property chain_ids: list[str]
- Returns:
A list of chain IDs this structure encompasses.
- property chain_ids_ligand: list[str]
- Returns:
A set of ligand chain IDs.
- property chain_ids_polymer: list[str]
- Returns:
A list of polymer chain IDs.
- property graph: PyGraph
- Returns:
A structure’s graph representation.
- property id: str
- Returns:
An identifier of this structure. It’s composed once upon initialization and has the following format:
{_name}:{polymer_chain_ids};{ligand_chain_ids}|{altloc_ids}
. It should uniquely identify a structure, i.e., one should expect two structures with the same ID to be identical.
- property is_empty_polymer: bool
Check if there are any polymer atoms.
- Returns:
True
if there are >=1 polymer atoms andFalse
otherwise.
- property is_singleton: bool
- Returns:
True
if the structure contains a single residue.
- property name: str
- Returns:
A name of the structure.
- class lXtractor.core.structure.Masks(primary_polymer: 'npt.NDArray[np.bool_]', primary_polymer_ptm: 'npt.NDArray[np.bool_]', primary_polymer_modified: 'npt.NDArray[np.bool_]', solvent: 'npt.NDArray[np.bool_]', ligand: 'npt.NDArray[np.bool_]', ligand_covalent: 'npt.NDArray[np.bool_]', ligand_poly: 'npt.NDArray[np.bool_]', ligand_nonpoly: 'npt.NDArray[np.bool_]', ligand_pep: 'npt.NDArray[np.bool_]', ligand_nuc: 'npt.NDArray[np.bool_]', ligand_carb: 'npt.NDArray[np.bool_]', unk: 'npt.NDArray[np.bool_]')[source]
Bases:
object
- __init__(primary_polymer, primary_polymer_ptm, primary_polymer_modified, solvent, ligand, ligand_covalent, ligand_poly, ligand_nonpoly, ligand_pep, ligand_nuc, ligand_carb, unk)
- ligand: ndarray[Any, dtype[bool_]]
- ligand_carb: ndarray[Any, dtype[bool_]]
- ligand_covalent: ndarray[Any, dtype[bool_]]
- ligand_nonpoly: ndarray[Any, dtype[bool_]]
- ligand_nuc: ndarray[Any, dtype[bool_]]
- ligand_pep: ndarray[Any, dtype[bool_]]
- ligand_poly: ndarray[Any, dtype[bool_]]
- primary_polymer: ndarray[Any, dtype[bool_]]
- primary_polymer_modified: ndarray[Any, dtype[bool_]]
- primary_polymer_ptm: ndarray[Any, dtype[bool_]]
- solvent: ndarray[Any, dtype[bool_]]
- unk: ndarray[Any, dtype[bool_]]
- class lXtractor.core.structure.NucleotideStructure(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]
Bases:
GenericStructure
A structure type where primary polymer is nucleotide.
See also
GenericStructure
for general-purpose documentation.
- class lXtractor.core.structure.ProteinStructure(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]
Bases:
GenericStructure
A structure type where primary polymer is peptide.
See also
GenericStructure
for general-purpose documentation.
- lXtractor.core.structure.mark_atoms(structure)[source]
Mark each atom in structure according to
lXtractor.core.config.AtomMark
.This function is used upon initializing
GenericStructure
and its subclasses, storing the output underGenericStructure.atom_marks
.- Parameters:
structure (GenericStructure) – An arbitrary structure.
- Returns:
An array of atom marks (equivalently, classes or types).
- Return type:
tuple[ndarray[Any, dtype[int64]], list[Ligand]]
- lXtractor.core.structure.mark_atoms_g(s, single_poly_chain=False)[source]
Mark structure atoms based on a molecular graph’s representation by of the
lXtractor.core.config.AtomMark
categories.Atoms are classified into five categories:
#. primary polymer: corresponds to ``PEP``, ``NUC`` or ``CARB`` categories. #. solvent: ``SOLVENT``. #. non polymer ligand: ``LIGAND``. #. polymer ligand: A combination of ``LIGAND`` with one of the primary polymer types, eg. ``AtomMark.LIGAND | AtomMark.NUC``. #. unknown: ``UNK`` for atoms that couldn't be categorized.
The classification process depends on groups of atoms forming covalent bonds with each other, or connected components in the molecular graph representation. Each such component is assessed separately and its atoms are classified as polymer, ligand, or solvent. If the primary polymer is set to “auto” in config (
DefaultConfig["structure"]["primary_pol_type"]
), the polymer with the largest number of monomers will be selected. The rest of the polymers will become polymer ligands: special kind of ligand that can have multiple residues. SeelXtractore.core.ligand.Ligand
for details.- Parameters:
s (GenericStructure) –
single_poly_chain (bool) –
- Returns:
- Return type:
(npt.NDArray[np.int_], str, list[Ligand])