lXtractor.core package
lXtractor.core.alignment module
A module handling multiple sequence alignments.
- class lXtractor.core.alignment.Alignment(seqs, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
Bases:
objectAn MSA resource: a collection of aligned sequences.
- __init__(seqs, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
- Parameters:
seqs (Iterable[tuple[str, str]]) – An iterable with (id, _seq) pairs.
add_method (AddMethod) – A callable adding sequences. Check the type for a signature.
align_method (AlignMethod) – A callable aligning sequences.
- add(other)[source]
Add sequences to existing ones using
add(). This is similar toalign()but automatically adds the aligned seqs.>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')]) >>> aa = a.add(('Y', 'ABXD')) >>> aa.shape (3, 4)
- align(seq)[source]
Align (add) sequences to this alignment via
add_method.>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')]) >>> aa = a.align(('Y', 'ABXD')) >>> aa.shape (1, 4) >>> aa.seqs [('Y', 'ABXD')]
- Parameters:
seq (abc.Iterable[_ST] | _ST | Alignment) – A sequence, iterable over sequences, or another
Alignment.- Returns:
A new alignment object with sequences from _seq. The original number of columns should be preserved, which is true when using the default
add_method.- Return type:
t.Self
- annotate(objs, map_name, accept_fn=None, **kwargs)[source]
This function “annotates” sequence segments using MSA.
Namely, it adds each sequence of the provided chain-type objects to sequences currently present in this MSA via
add_method. The latter is expected to preserve the original number of MSA columns, whereas potentially cutting the original sequence, thereby defining MSA-imposed boundaries. These are used to extract a child object usingspawn_childmethod, which will have the corresponding MSA numbering written under map_name.- Parameters:
objs (abc.Iterable[_CT]) – An iterable over chain-type objects.
map_name (str) – A name to use for storing the derived MSA numbering map.
accept_fn (abc.Callable[[_CT], bool] | None) – A function accepting a chain-type object and returning a boolean value indicating whether the spawn child sequence should be preserved.
kwargs – Additional keyword arguments passed to the
spawn_child()method.
- Returns:
An iterator over spawned child objects. These are automatically stored under the
childrenattribute of each chain-type object, in which case it’s safe to simply consume the returned iterator.
- filter_gaps(max_frac=1.0, dim=0)[source]
Filter sequences or alignment columns having >= max_frac of gaps.
>>> a = Alignment([('A', 'AB---'), ('X', 'XXXX-'), ('Y', 'YYYY-')])
By default, the max_frac gaps is 1.0, which would remove solely gap-only sequences.
>>> aa = a.filter_gaps(dim=0) >>> aa == a True
Specifying max_frac removes sequences with over 50% gaps.
>>> aa = a.filter_gaps(dim=0, max_frac=0.5) >>> 'A' not in aa True
The last column is removed.
>>> a.filter_gaps(dim=1).shape (3, 4)
- Parameters:
max_frac (float) – a maximum fraction of allowed gaps in a sequence or a column.
dim (int) –
0for sequences,1for columns.
- Returns:
A new
Alignmentobject with filtered sequences or columns.- Return type:
t.Self
- itercols(*, join=True)[source]
Iterate over the Alignment columns.
>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')]) >>> list(a.itercols()) ['AX', 'BX', 'CX', 'DX']
- Parameters:
join (bool) – Join columns into a string.
- Returns:
An iterator over columns.
- Return type:
Iterator[str] | Iterator[list[str]]
- classmethod make(seqs, method=<function mafft_align>, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
Create a new alignment from a collection of unaligned sequences. For aligned sequences, please utilize
read().- Parameters:
seqs (Iterable[tuple[str, str]]) – An iterable over (header, _seq) objects.
method (AlignMethod) – A callable accepting unaligned sequences and returning the aligned ones.
add_method (AddMethod) – A sequence addition method for a new
Alignmentobject.align_method (AlignMethod) – An alignment method for a new
Alignmentobject.
- Returns:
An alignment created from aligned seqs.
- Return type:
- map(fn)[source]
Map a function to sequences.
>>> a = Alignment([('A', 'AB---')]) >>> a.map(lambda x: (x[0].lower(), x[1].replace('-', '*'))).seqs [('a', 'AB***')]
- classmethod read(inp, read_method=<function read_fasta>, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
Read sequences and create an alignment.
- Parameters:
inp (Path | TextIOBase | abc.Iterable[str]) – A Path to aligned sequences, or a file handle, or iterable over file lines.
read_method (SeqReader) – A method accepting inp and returning an iterable over pairs (header, _seq). By default, it’s
read_fasta(). Hence, the default expected format is fasta.add_method (AddMethod) – A sequence addition method for a new
Alignmentobject.align_method (AlignMethod) – An alignment method for a new
Alignmentobject.
- Returns:
An alignment with sequences read parsed from the provided input.
- Return type:
t.Self
- classmethod read_make(inp, read_method=<function read_fasta>, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
A shortcut combining
read()andmake().- It parses sequences from inp, aligns them and creates
the
Alignmentobject.
- Parameters:
inp (Path | TextIOBase | abc.Iterable[str]) – A Path to aligned sequences, or a file handle, or iterable over file lines.
read_method (SeqReader) – A method accepting inp and returning an iterable over pairs (header, _seq). By default, it’s
read_fasta(). Hence, the default expected format is fasta.add_method (AddMethod) – A sequence addition method for a new
Alignmentobject.align_method (AlignMethod) – An alignment method for a new
Alignmentobject.
- Returns:
An alignment from parsed and aligned inp sequences.
- Return type:
t.Self
- realign()[source]
Realign sequences in
seqsusingalign_method.- Returns:
A new
Alignmentobject with realigned sequences.
- remove(item, error_if_missing=True, realign=False)[source]
Remove a sequence or collection of sequences.
>>> a = Alignment([('A', 'ABCD-'), ('X', 'XXXX-'), ('Y', 'YYYYY')]) >>> aa = a.remove('A') >>> 'A' in aa False >>> aa = a.remove(('Y', 'YYYYY')) >>> aa.shape (2, 5) >>> aa = a.remove(('Y', 'YYYYY'), realign=True) >>> aa.shape (2, 4) >>> aa['A'] 'ABCD' >>> aa = a.remove(['X', 'Y']) >>> aa.shape (1, 5)
- Parameters:
item (str | _ST | t.Iterable[str] | t.Iterable[_ST]) –
One of the following:
A
str: a sequence’s name.A pair
(str, str)– a name with the sequence itself.An iterable over sequence enames or pairs (not mixed!)
error_if_missing (bool) – Raise an error if any of the items are missing.
realign (bool) – Realign seqs after removal.
- Returns:
A new
Alignmentobject with the remaining sequences.- Return type:
t.Self
- slice(start, stop, step=None)[source]
Slice alignment columns.
>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')]) >>> aa = a.slice(1, 2) >>> aa.shape == (2, 2) True >>> >>> aa.seqs[0] ('A', 'AB') >>> aa = a.slice(-4, 10) >>> aa.seqs[0] ('A', 'ABCD')
To add the aligned sequences to the existing ones, use
+oradd():>>> aaa = a + aa >>> aaa.shape (3, 4)
- Parameters:
start (int) – Start coordinate, boundaries inclusive.
stop (int) – Stop coordinate, boundaries inclusive.
step (int | None) – Step for slicing, i.e., take every column separated by step - 1 number of columns.
- Returns:
A new alignment with sequences subset according to the slicing params.
- Return type:
t.Self
- write(out, write_method=<function write_fasta>)[source]
Write an alignment.
- Parameters:
out (Path | SupportsWrite) – Any object with the write method.
write_method (SeqWriter) – The writing function itself, accepting sequences and out. By default, use read_fasta to write in fasta format.
- Returns:
Nothing.
- Return type:
None
- align_method: AlignMethod
- seqs: list[tuple[str, str]]
- property shape: tuple[int, int]
- Returns:
(# sequences, # columns)
lXtractor.core.base module
Base classes, commong types and functions for the core module.
- class lXtractor.core.base.AbstractResource(resource_path, resource_name)[source]
Bases:
objectAbstract base class defining basic interface any resource must provide.
- class lXtractor.core.base.AddMethod(*args, **kwargs)[source]
Bases:
ProtocolA callable to add sequences to the aligned ones, preserving the alignment length.
- __init__(*args, **kwargs)
- class lXtractor.core.base.AlignMethod(*args, **kwargs)[source]
Bases:
ProtocolA callable to align arbitrary sequences.
- __init__(*args, **kwargs)
- class lXtractor.core.base.ApplyT(*args, **kwargs)[source]
Bases:
Protocol[T]- __init__(*args, **kwargs)
- class lXtractor.core.base.ApplyTWithArgs(*args, **kwargs)[source]
Bases:
Protocol[T]- __init__(*args, **kwargs)
- class lXtractor.core.base.FilterT(*args, **kwargs)[source]
Bases:
Protocol[T]- __init__(*args, **kwargs)
- class lXtractor.core.base.NamedTupleT(*args, **kwargs)[source]
Bases:
Protocol,Iterable- __init__(*args, **kwargs)
- class lXtractor.core.base.Ord(*args, **kwargs)[source]
Bases:
Protocol[_T]Any objects defining comparison operators.
- __init__(*args, **kwargs)
- class lXtractor.core.base.ResNameDict[source]
Bases:
UserDictA dictionary providing mapping between PDB residue names and their one-letter codes. The mapping was parsed from the CCD and can be obtained by calling
lXtractor.ext.ccd.CCD.make_res_name_map().>>> d = ResNameDict() >>> assert d['ALA'] == 'A'
- class lXtractor.core.base.SeqFilter(*args, **kwargs)[source]
Bases:
ProtocolA callable accepting a pair (header, _seq) and returning a boolean.
- __init__(*args, **kwargs)
- class lXtractor.core.base.SeqMapper(*args, **kwargs)[source]
Bases:
ProtocolA callable accepting and returning a pair (header, _seq).
- __init__(*args, **kwargs)
- class lXtractor.core.base.SeqReader(*args, **kwargs)[source]
Bases:
ProtocolA callable reading sequences into tuples of (header, _seq) pairs.
- __init__(*args, **kwargs)
- class lXtractor.core.base.SeqWriter(*args, **kwargs)[source]
Bases:
ProtocolA callable writing (header, _seq) pairs to disk.
- __init__(*args, **kwargs)
- class lXtractor.core.base.SupportsLT(*args, **kwargs)[source]
Bases:
Protocol[_T]- __init__(*args, **kwargs)
lXtractor.core.config module
A module encompassing various settings of lXtractor objects.
- class lXtractor.core.config.AtomMark(value)[source]
Bases:
IntFlagThe atom categories. Some categories may be combined, e.g., LIGAND | PEP is another valid category denoting ligand peptide atoms.
- CARB: int = 32
Carbohydrate polymer atoms.
- COVALENT: int = 64
Covalent polymer modifications including ligands.
- LIGAND: int = 4
Ligand atom. If not combined with PEP, NUC, or CARB, this category denotes non-polymer (small molecule) single-residue ligands.
- NUC: int = 16
Nucleotide polymer atoms.
- PEP: int = 8
Peptide polymer atoms.
- SOLVENT: int = 2
Solvent atom.
- UNK: int = 1
Unknown atom.
- class lXtractor.core.config.Config(default_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/default_config.json'), user_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/user_config.json'))[source]
Bases:
UserDictA configuration management class.
This class facilitates the loading and saving of configuration settings, with a user-specified configuration overriding the default settings.
- Parameters:
default_config_path (str | Path) – The path to the default config file. This is a reference default settings, which can be used to reset user settings if needed.
user_config_path (str | Path) – The path to the user configuration file. This file is stored internally and can be modified by a user to provide permanent settings.
Loading and mofifying the config:
>>> cfg = Config() >>> list(cfg.keys())[:2] ['bonds', 'colnames'] >>> cfg['bonds']['non_covalent_upper'] 5.0 >>> cfg['bonds']['non_covalent_upper'] = 6
Equivalently, one can update the config by a local JSON file or dict:
>>> cfg.update_with({'bonds': {'non_covalent_upper': 4}}) >>> assert cfg['bonds']['non_covalent_upper'] == 4
The changes can be stored internally and loaded automatically in the future:
>>> cfg.save() >>> cfg = Config() >>> assert cfg['bonds']['non_covalent_upper'] == 4
To restore default settings:
>>> cfg.reset_to_defaults() >>> cfg.clear_user_config()
- __init__(default_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/default_config.json'), user_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/user_config.json'))[source]
- save(user_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/user_config.json'))[source]
Save the current configuration. By default, will store the configuration internally. This stored configuration will be loaded automatically on top of the default configuration.
- Parameters:
user_config_path (str | Path) – The path where to save the user configuration file.
- Raises:
ValueError – If the user config path is not provided.
- temporary_namespace()[source]
A context manager for a temporary config namespace.
Within this context, changes to the config are allowed, but will be reverted back to the original config once the context is exited.
Example:
>>> cfg = Config() >>> with cfg.temporary_namespace(): ... cfg['bonds']['non_covalent_upper'] = 10 ... # Do some stuff with the temporary config... ... # Config is reverted back to original state here >>> assert cfg['bonds']['non_covalent_upper'] != 10
lXtractor.core.exceptions module
- exception lXtractor.core.exceptions.ConfigError[source]
Bases:
ValueErrorSome configuration problem.
lXtractor.core.segment module
Module defines a segment object serving as base class for sequences in lXtractor.
- class lXtractor.core.segment.Segment(start, end, name='S', seqs=None, parent=None, children=None, meta=None, variables=None)[source]
Bases:
Sequence[NamedTupleT]An arbitrary segment with inclusive boundaries containing arbitrary number of sequences.
Sequences themselves may be retrieved via
[]syntax:>>> s = Segment(1, 10, 'S', seqs={'X': list(range(10))}) >>> s.id == 'S|1-10' True >>> s['X'] == list(range(10)) True >>> 'X' in s True
One can use the same syntax to check if a Segment contains certain index:
>>> 1 in s and 10 in s and not 11 in s True
Iteration over the segment yields it’s items:
>>> next(iter(s)) Item(i=1, X=0)
One can just get the same item by explicit index:
>>> s[1] Item(i=1, X=0)
Slicing returns an iterable slice object:
>>> list(s[1:2]) [Item(i=1, X=0), Item(i=2, X=1)]
One can add a new sequence in two ways.
using a method:
>>> s.add_seq('Y', tuple(range(10, 20))) >>> 'Y' in s True
using
[]syntax:
>>> s['Y'] = tuple(range(10, 20)) >>> 'Y' in s True
Note that using the first method, if
salready containsY, this will cause an exception. To overwrite a sequence with the same name, please use explicit[]syntax.Additionally, one can offset Segment indices using
>>/<<syntax. This operation mutates original Segment!>>> s >> 1 S|2-11 >>> 11 in s True
- __init__(start, end, name='S', seqs=None, parent=None, children=None, meta=None, variables=None)[source]
- Parameters:
start (int) – Start coordinate.
end (int) – End coordinate.
name (str) – The name of the segment. Name with start and end coordinates should uniquely specify the segment. They are used to dynamically construct
id().seqs (dict[str, abc.Sequence[t.Any]] | None) – A dictionary name => sequence, where sequence is some sequence (preferably mutable) bounded by segment. Name of a sequence must be “simple”, i.e., convertable to a field of a namedtuple.
parent (t.Self | None) – Parental segment bounding this instance, typically obtained via
sub()orsub_by()methods.children (abc.MutableSequence[t.Self] | None) – A mapping name =>
Segmentwith child segments bounded by this instance.meta (dict[str, t.Any] | None) – A dictionary with any meta-information str() => str() since reading/writing meta to disc will inevitably convert values to strings.
variables (Variables | None) – A collection of variables calculated or staged for calculation for this segment.
- add_seq(name, seq)[source]
Add sequence to this segment.
- Parameters:
name (str) – Sequence’s name. Should be convertible to the namedtuple’s field.
seq (Sequence[Any]) – A sequence with arbitrary elements and the length of a segment.
- Returns:
returns nothing. This operation mutates attr:`seqs.
- Raises:
ValueError – If the name is reserved by another segment.
- Return type:
None
- append(other, filler=<function Segment.<lambda>>, joiner=<built-in function add>)[source]
Append another segment to this one.
The encompassed sequences will be merged together by joiner. If a sequence is missing in this segment or other, filler will create a sequence with filled values. The sequences will be deep-copied before merge.
>>> a = Segment(1, 3, "A", seqs={"A": "AAA"}) >>> b = Segment(1, 2, "B", seqs={"B": "BB"}) >>> c = a.append(b, filler=lambda x: '*' * x) >>> c.id 'A|1-5' >>> c['A'] 'AAA**' >>> c['B'] '***BB'
Note that the same can be achieved via
|operator:>>> a | b == a.append(b, filler=lambda x: '*' * x) True
This will use
"*"filler forstr-type sequences andNonefor the rest and use the default joiner for joining them.Note
Appending to an empty segment will return other. Appending an empty segment will return this segment.
Warning
Appending creates a new segment and removes associated parent and metadata
- Parameters:
other (t.Self) – Another arbitrary segment.
filler (_Filler | abc.Mapping[str, _Filler]) – A callable accepting the positive integer and returning a filled in a sequence or a
dictmapping sequence names to such callables.joiner (_Joiner | abc.Mapping[str, _Joiner]) – A callable accepting two sequences and returning a merged sequence or a
dictmapping sequence names to such callables.
- Returns:
A new segment with the same name as this segment, extended by other.
- Return type:
t.Self
- bounded_by(other)[source]
Check whether this segment is bounded by other.
self: +----+ other: +------+ => True
:param other; Another segment.
- Return type:
bool
- bounds(other)[source]
Check if this segment bounds other.
self: +-------+ other: +----+ => True
:param other; Another segment.
- Return type:
bool
- insert(other, i, **kwargs)[source]
Insert a segment into this one.
The function splits this segment into two parts at the provided index and insert other between them via
append(). The latter handles common/unique sequences via filler and joiner arguments, which can be passed here as keyword arguments.Note
Inserting an empty segment returns this instance. Inserting a segment at the
end()appends other.Warning
Inserting creates a new segment and removes associated parent and metadata
- Parameters:
other (t.Self) – Another segment to insert.
i (int) – Index to insert at. The insertion will be performed after i.
kwargs – Passed to
append().
- Returns:
A new segment with inserted other.
- Raises:
IndexError – If attempting to insert at an invalid index. Only indices
start < i <= endare valid.- Return type:
t.Self
- overlap(start, end)[source]
Create new segment from the current instance using overlapping boundaries.
- Parameters:
start (int) – Starting coordinate.
end (int) – Ending coordinate.
- Returns:
New overlapping segment with
dataandname- Return type:
t.Self
- overlap_with(other, deep_copy=True, handle_mode='merge', sep='&')[source]
Overlap this segment with other over common indices.
self: +---------+ other: +-------+ =>: +-----+
- Parameters:
deep_copy (bool) – deepcopy seqs to avoid side effects.
handle_mode (str) –
When the child overlapping segment is created, this parameter defines how
nameandmetaare handled. The following values are possible:”merge”: merge meta and name from self and other
”self”: the current instance provides both attributes
”other”: other provides both attributes
sep (str) – If handle_mode == “merge”, the new name is created by joining names of self and other using this separator.
- Returns:
New segment instance with inherited name and meta.
- Return type:
t.Self
- overlaps(other)[source]
Check whether a segment overlaps with the other segment. Use
overlap_with()to produce an overlapping childSegment.
- remove_seq(name)[source]
Remove sequence from this segment.
- Parameters:
name (str) – Sequence’s name. If doesn’t exist in this segment, nothing happens.
- sub(start, end, **kwargs)[source]
Subset current segment using provided boundaries. Will create a new segment and call
sub_by().- Parameters:
start (int) – new start.
end (int) – new end.
kwargs – passed to
overlap_with()
- Return type:
t.Self
- sub_by(other, **kwargs)[source]
A specialized version of
overlap_with()used in cases where other is assumed to be a part of the current segment (hence, a subsegment).- Parameters:
other (Segment) – Some other segment contained within the (start, end) boundaries.
kwargs – Passed to
overlap_with().
- Returns:
A new
Segmentobject with boundaries of other. Seeoverlap_with()on how to handle segments’ names and data.- Raises:
NoOverlap – If other’s boundaries lie outside the existing
start,end.- Return type:
t.Self
- children
- property end: int
- Returns:
A Segment’s end coordinate.
- property id: str
- Returns:
Unique segment’s identifier encapsulating name, boundaries and parents of a segment if it was spawned from another
Segmentinstance. For example:S|1-2<-(P|1-10)
would specify a segment S with boundaries
[1, 2]descended from P.
- property is_empty: bool
- Returns:
Trueif the segment is empty. Emptiness is a special case, in whichSegmenthas start == end == 0.
- property is_singleton: bool
- Returns:
Trueif the segment contains a single element. In this special case,start == end.
- property item_type: _Item
A factory to make an Item namedtuple object encapsulating sequence names contained within this instance. The first field is reserved for “i” – an index. :return: Item namedtuple object.
- meta: dict[str, t.Any]
- property name: str
- property parent: t.Self | None
- property seq_names: list[str]
- Returns:
A list of sequence names this segment entails.
- property start: int
- Returns:
A Segment’s start coordinate.
- lXtractor.core.segment.do_overlap(segments)[source]
Check if any pair of segments overlap.
- Parameters:
segments (Iterable[Segment]) – an iterable with at least two segments.
- Returns:
Trueif there are overlapping segments,Falseotherwise.- Return type:
bool
- lXtractor.core.segment.map_segment_numbering(segments_from, segments_to)[source]
Create a continuous mapping between the numberings of two segment collections. They must contain the same number of equal length non-overlapping segments. Segments in the segments_from collection are considered to span a continuous sequence, possibly interrupted due to discontinuities in a sequence represented by segments_to’s segments. Hence, the segments in segments_from form continuous numbering over which numberings of segments_to segments are joined.
- Parameters:
- Returns:
An iterable over (key, value) pairs. Keys correspond to numberings of the segments_from, values – to numberings of segments_to.
- Return type:
Iterator[tuple[int, int | None]]
- lXtractor.core.segment.resolve_overlaps(segments, value_fn=<built-in function len>, max_it=None, verbose=False)[source]
Eliminate overlapping segments.
Convert segments into and undirected graph (see
segments2graph()). Iterate over connected components. If a component has only a single node (no overlaps§), yield it. Otherwise, consider all possible non-overlapping subsets of nodes. Find a subset such that the sum of the value_fn over the segments is maximized and yield nodes from it.- Parameters:
segments (Iterable[Segment]) – A collection of possibly overlapping segments.
value_fn (Callable[[Segment], float]) – A function accepting the segment and returning its value.
max_it (int | None) – The maximum number of subsets to consider when resolving a group of overlapping segments.
verbose (bool) – Progress bar and general info.
- Returns:
A collection of non-overlapping segments with maximum cumulative value. Note that the optimal solution is guaranteed iff the number of possible subsets for an overlapping group does not exceed max_it.
- Return type:
Generator[Segment, None, None]
- lXtractor.core.segment.segments2graph(segments)[source]
Convert segments to an undirected graph such that segments are nodes and edges are drawn between overlapping segments.
- Parameters:
segments (Iterable[Segment]) – an iterable with segments objects.
- Returns:
an undirected graph.
- Return type:
Graph
lXtractor.core.structure module
Module defines basic interfaces to interact with macromolecular structures.
- class lXtractor.core.structure.CarbohydrateStructure(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]
Bases:
GenericStructureA structure type where primary polymer is carbohydrate.
See also
GenericStructurefor general-purpose documentation.
- class lXtractor.core.structure.GenericStructure(array, name, ligands=None, atom_marks=None, graph=None)[source]
Bases:
objectA generic macromolecular structure with possibly many chains holding a single
biotite.structure.AtomArrayinstance.This object is a core data structure in lXtractor for structural data.
The object is considered immutable: atoms of a structure can’t change their location or properties, as well as other protected attributes.
While atoms are stored as
biotite.structure.AtomArray, GenericStructure defines additional annotations for each atom and operations crucial for other objects such aslXtractor.core.chain.ChainStructure.Upon initialization, atom array attains graph representation (
graph()) usinglXtractor.util.structure.to_graph()function. Using this representation, atom annotations are attained via :func``mark_atoms_g`. These annotations can be accessed viaatom_marks(). For convenience, boolean masks are stored and can be applied to thearray()as follows:# Assume ``s`` is a :class:`GenericStructure` object. s[s.mask.`mask_name`]
To view available mask names, see
Masks.One of the most crucial annotations is the so-called “primary_polymer”. These atoms serve as a frame of reference for all other atoms in a structure. The rest of the atoms are categorized as either ligand or solvent. Sometimes the annotation process fails to identify certain atoms. In such cases, a warning is logged. To view uncategorized atoms, one can use the following mask:
s[s.mask.unk]
Note
Using
__getitem__(item)like ins[s.mask.unkwill return an atom array. Usesubset()to obtain a new generic structure or initialize a new ``GenericStructure(s[s.mask.unk] instance; it will be equivalent.Methods
__repr__and__str__output a string in the format:{_name}:{polymer_chain_ids};{ligand_chain_ids}|{altloc_ids}where*idsare “,”-separated.- __init__(array, name, ligands=None, atom_marks=None, graph=None)[source]
- Parameters:
array (AtomArray) – Atom array object.
name (str) – ID of a structure in array.
ligands (Sequence[Ligand] | None) – A list of ligands or flag indicating to extract ligands during initialization.
- extract_positions(pos, chain_ids=None, **kwargs)[source]
Extract specific positions from this structure.
- Parameters:
pos (abc.Sequence[int]) – A sequence of positions (res_id) to extract.
chain_ids (abc.Sequence[str] | str | None) – Optionally, a single chain ID or a sequence of such.
kwargs – Passed to
subset().
- Returns:
A new instance with extracted residues.
- Return type:
t.Self
- extract_segment(start, end, chain_id, **kwargs)[source]
Create a sub-structure encompassing some continuous segment bounded by existing position boundaries.
- Parameters:
start (int) – Residue number to start from (inclusive).
end (int) – Residue number to stop at (inclusive).
chain_id (str) – Chain to extract a segment from.
kwargs – Passed to
subset().
- Returns:
A new Generic structure with residues in
[start, end].- Return type:
t.Self
- get_sequence()[source]
- Returns:
A generator over tuples, where each residue is described by: (1) one-letter code, (2) three-letter code, (3) residue number.
- Return type:
Generator[tuple[str, str, int]]
- classmethod make_empty(structure_id='XXXX')[source]
- Parameters:
structure_id (str) – (Optional) ID of the created array.
- Returns:
An instance with empty
array().- Return type:
t.Self
- classmethod read(inp, path2id=<function GenericStructure.<lambda>>, structure_id='XXXX', altloc=False, **kwargs)[source]
Parse the atom array from the provided input and wrap it into the
GenericStructureobject.Note
If inp is not a
Path,kwargsmust contain the correctfmt(e.g.,fmt=cif).- Parameters:
inp (IOBase | Path | str | bytes) – Path to a structure in supported format.
path2id (abc.Callable[[Path], str]) – A callable obtaining a PDB ID from the file path. By default, it’s a
Path.stem.structure_id (str) – A structure unique identifier (e.g., PDB ID). If not provided and the input is
Path, will usepath2idto infer the ID. Otherwise, will use a constant placeholder.altloc (bool | str) – Parse alternative locations and populate
array.altloc_idattribute.kwargs – Passed to
load_structure.
- Returns:
Parsed structure.
- Return type:
t.Self
- rm_solvent(copy=False)[source]
- Parameters:
copy (bool) – Copy the resulting substructure.
- Returns:
A substructure with solvent molecules removed.
- split_altloc(**kwargs)[source]
Split into substructures based on altloc IDs. Atoms missing altloc annotations are distributed into every substructure. Thus, even if a structure contains a single atom having altlocs (say, A and B), this method will produce two substructed identical except for this atom.
Note
If
array()does not specify any altloc ID, the method yieldsself.- Parameters:
kwargs – Passed to
subset().- Returns:
An iterator over objects of the same type initialized by atoms having altloc annotations.
- Return type:
abc.Iterator[t.Self]
- split_chains(polymer=False, **kwargs)[source]
Split into separate chains. Splitting is done using
biotite.structure.get_chain_starts().Note
Preserved ligands may have a different
chain_id.Note
If there is a single chain, this method will return
self.
- subset(mask, ligands=True, reinit_ligands=False, copy=False)[source]
Create a sub-structure potentially preserving connected
ligands().Warning
If
DefaultConfig["structure"]["primary_pol_type"]is set to auto, and mask points to a polymer that is shorter than some existing ligand polymer, this ligand polymer will become a primary polymer in the substructure.- Parameters:
mask (np.ndarray) – Boolean mask,
Truefor atoms inarray(), used to create a sub-structure.ligands (bool) – Keep ligands that are connected to atoms specified by mask.
reinit_ligands (bool) – Reinitialize ligands upon creating a sub-structure, rather than filtering existing ligands connected to atoms specified by mask. Takes precedence over the ligands option. This option is used in
split_altloc().copy (bool) – Copy the atom array resulting from subsetting the original one.
- Returns:
A new instance with atoms defined by mask and connected ligands.
- Return type:
t.Self
- superpose(other, res_id_self=None, res_id_other=None, atom_names_self=None, atom_names_other=None, mask_self=None, mask_other=None)[source]
Superpose other structure to this one. Arguments to this function all serve a single purpose: to correctly subset both structures so the resulting selections have the same number of atoms.
The subsetting achieved either by specifying residue numbers and atom names or by supplying a binary mask of the same length as the number of atoms in the structure.
- Parameters:
other (GenericStructure | AtomArray) – Other
GenericStructureor atom array.res_id_self (Iterable[int] | None) – Residue numbers to select in this structure.
res_id_other (Iterable[int] | None) – Residue numbers to select in other structure.
atom_names_self (Iterable[Sequence[str]] | Sequence[str] | None) – Atom names to select in this structure given either per-residue or as a single sequence broadcasted to selected residues.
atom_names_other (Iterable[Sequence[str]] | Sequence[str] | None) – Same as self.
mask_self (ndarray | None) – Binary mask to select atoms. Takes precedence over other selection arguments.
mask_other (ndarray | None) – Same as self.
- Returns:
A tuple of (1) an other structure superposed onto this one, (2) an RMSD of the superposition, and (3) a transformation that had been used with
biotite.structure.superimpose_apply().- Return type:
tuple[GenericStructure, float, tuple[ndarray, ndarray, ndarray]]
- write(path, atom_marks=True, graph=True)[source]
Save this structure to a file. The format is automatically determined from the given path.
Additional files are saved using the same filename alongside the structure file. The filename will resolve to “structure” in all the following cases and result in “structure.npy” and “structure.json” files saved to the same dir:
path="/path/to/structure.pdb" path="/path/to/structure.mmtf.gz" path="/path/to/structure.with.many.dots.pdb.gz"
- Parameters:
path (PathLike | str) – A path or a path-like object compatible with
open(). Must not point to an existing directory. Must provide the structure format as an extension.atom_marks (bool) – Save an array of atom marks in the npy format.
graph (bool) – Save molecular connectivity graph in the json format.
- Returns:
Path to the saved structure if writing was successful.
- Return type:
Path
- property altloc_ids: list[str]
- Returns:
A sorted list of altloc IDs. If none found, will output
[""].
- property array: AtomArray
- Returns:
Atom array object.
- property atom_marks: ndarray[tuple[int, ...], dtype[int64]]
- Returns:
An array of
lXtractor.core.config.AtomMarkmarks, categorizing each atom in this structure.
- property chain_ids: list[str]
- Returns:
A list of chain IDs this structure encompasses.
- property chain_ids_ligand: list[str]
- Returns:
A set of ligand chain IDs.
- property chain_ids_polymer: list[str]
- Returns:
A list of polymer chain IDs.
- property graph: PyGraph
- Returns:
A structure’s graph representation.
- property id: str
- Returns:
An identifier of this structure. It’s composed once upon initialization and has the following format:
{_name}:{polymer_chain_ids};{ligand_chain_ids}|{altloc_ids}. It should uniquely identify a structure, i.e., one should expect two structures with the same ID to be identical.
- property is_empty_polymer: bool
Check if there are any polymer atoms.
- Returns:
Trueif there are >=1 polymer atoms andFalseotherwise.
- property is_singleton: bool
- Returns:
Trueif the structure contains a single residue.
- property name: str
- Returns:
A name of the structure.
- class lXtractor.core.structure.Masks(primary_polymer: 'npt.NDArray[np.bool_]', primary_polymer_ptm: 'npt.NDArray[np.bool_]', primary_polymer_modified: 'npt.NDArray[np.bool_]', solvent: 'npt.NDArray[np.bool_]', ligand: 'npt.NDArray[np.bool_]', ligand_covalent: 'npt.NDArray[np.bool_]', ligand_poly: 'npt.NDArray[np.bool_]', ligand_nonpoly: 'npt.NDArray[np.bool_]', ligand_pep: 'npt.NDArray[np.bool_]', ligand_nuc: 'npt.NDArray[np.bool_]', ligand_carb: 'npt.NDArray[np.bool_]', unk: 'npt.NDArray[np.bool_]')[source]
Bases:
object- __init__(primary_polymer, primary_polymer_ptm, primary_polymer_modified, solvent, ligand, ligand_covalent, ligand_poly, ligand_nonpoly, ligand_pep, ligand_nuc, ligand_carb, unk)
- ligand: ndarray[tuple[int, ...], dtype[bool]]
- ligand_carb: ndarray[tuple[int, ...], dtype[bool]]
- ligand_covalent: ndarray[tuple[int, ...], dtype[bool]]
- ligand_nonpoly: ndarray[tuple[int, ...], dtype[bool]]
- ligand_nuc: ndarray[tuple[int, ...], dtype[bool]]
- ligand_pep: ndarray[tuple[int, ...], dtype[bool]]
- ligand_poly: ndarray[tuple[int, ...], dtype[bool]]
- primary_polymer: ndarray[tuple[int, ...], dtype[bool]]
- primary_polymer_modified: ndarray[tuple[int, ...], dtype[bool]]
- primary_polymer_ptm: ndarray[tuple[int, ...], dtype[bool]]
- solvent: ndarray[tuple[int, ...], dtype[bool]]
- unk: ndarray[tuple[int, ...], dtype[bool]]
- class lXtractor.core.structure.NucleotideStructure(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]
Bases:
GenericStructureA structure type where primary polymer is nucleotide.
See also
GenericStructurefor general-purpose documentation.
- class lXtractor.core.structure.ProteinStructure(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]
Bases:
GenericStructureA structure type where primary polymer is peptide.
See also
GenericStructurefor general-purpose documentation.
- lXtractor.core.structure.mark_atoms(structure)[source]
Mark each atom in structure according to
lXtractor.core.config.AtomMark.This function is used upon initializing
GenericStructureand its subclasses, storing the output underGenericStructure.atom_marks.- Parameters:
structure (GenericStructure) – An arbitrary structure.
- Returns:
An array of atom marks (equivalently, classes or types).
- Return type:
tuple[ndarray[tuple[int, …], dtype[int64]], list[Ligand]]
- lXtractor.core.structure.mark_atoms_g(s, single_poly_chain=False)[source]
Mark structure atoms based on a molecular graph’s representation by of the
lXtractor.core.config.AtomMarkcategories.Atoms are classified into five categories:
#. primary polymer: corresponds to ``PEP``, ``NUC`` or ``CARB`` categories. #. solvent: ``SOLVENT``. #. non polymer ligand: ``LIGAND``. #. polymer ligand: A combination of ``LIGAND`` with one of the primary polymer types, eg. ``AtomMark.LIGAND | AtomMark.NUC``. #. unknown: ``UNK`` for atoms that couldn't be categorized.
The classification process depends on groups of atoms forming covalent bonds with each other, or connected components in the molecular graph representation. Each such component is assessed separately and its atoms are classified as polymer, ligand, or solvent. If the primary polymer is set to “auto” in config (
DefaultConfig["structure"]["primary_pol_type"]), the polymer with the largest number of monomers will be selected. The rest of the polymers will become polymer ligands: special kind of ligand that can have multiple residues. SeelXtractore.core.ligand.Ligandfor details.- Parameters:
s (GenericStructure)
single_poly_chain (bool)
- Returns:
- Return type:
(npt.NDArray[np.int_], str, list[Ligand])
lXtractor.core.ligand module
- class lXtractor.core.ligand.Ligand(parent, mask, contact_mask, ligand_idx, dist, meta=None)[source]
Bases:
objectLigand object is a part of the structure falling under certain criteria.
Namely, a ligand is a non-polymer and non-solvent molecule or a single monomer. Such ligands will be designated using the format:
{res_name}_{res_id}:{chain_id}<-({parent})
If a ligand contains multiple monomers, by convention, this is a polymer ligand. Such ligands should be named using the first letter of the polymer type; one of the
("p", "n", "c"). In this case, it’s ID will be of the following format:{polymer_type}_{min_res_id}-{max_res_id}:{chain_id}<-({parent})
This information is provided by
metaand shouldn’t be changed. However, any additional fields can be stored inmetawhich will be retrieved when constructingsummary().Attributes
maskandcontact_maskare boolean masks allowing to obtain ligand and ligand-contacting atoms fromparent.- ..seealso ::
make_ligand()to initialize a new ligand in an easy way.
- is_locally_connected(mask)[source]
Check whether this ligand is connected to a subset of parent atoms.
- Parameters:
mask (ndarray) – A boolean mask to filter parent atoms.
- Returns:
Trueif the ligand has at least min_atom_connections toparentsubstructure imposed by the provided mask.- Return type:
bool
- property chain_id: str
- Returns:
Ligand chain ID.
- contact_mask: np.ndarray
A boolean mask such that when applied to the parent, subsets the latter to its ligand-contacting atoms.
- dist
An array of distances for each ligand-contacting parent’s atom.
- property id: str
- is_polymer
- ligand_idx
An integer array with indices pointing to ligand atoms contacting the parent structure.
- mask
A boolean mask such that when applied to the parent, subsets the latter to the ligand residues.
- meta
A dictionary of meta info.
- parent: GenericStructure
Parent structure.
- property parent_contact_atoms: AtomArray
- Returns:
An array of ligand-contacting atoms within
parent.
- property parent_contact_chains: set[str]
- Returns:
A set of chain IDs involved in forming contacts with ligand.
- property res_id: str
- Returns:
Ligand residue number.
- property res_name: str
- Returns:
Ligand residue name.
- lXtractor.core.ligand.ligands_from_atom_marks(structure)[source]
- Return type:
abc.Generator[Ligand, None, None]
- lXtractor.core.ligand.make_ligand(m_lig, m_pol, structure)[source]
Create a new
Ligandobject. The criteria to qualify for a ligand are defined by the global config (DefaultConfig["ligand"]).Whether a ligand molecule is created is subject to several checks:
#. It has a certain number of atoms. #. It has a certain number of contacts with the polymer. #. It contacts a certain number of residues in the polymer. #. Its atoms span a single chain.
If a ligand doesn’t pass any of these checks, the function returns
None.- Parameters:
m_lig (npt.NDArray[np.bool_]) – A boolean mask pointing to putative ligand atoms.
m_pol (npt.NDArray[np.bool_]) – A boolean mask pointing to polymer atoms that supposedly contact ligand atoms.
structure (GenericStructure) – A parent structure to which the masks can be applied.
- Returns:
An instantiated ligand or
Noneif the checks were not passed.- Return type:
Ligand | None
lXtractor.core.pocket module
The module defines Pocket, representing an arbitrarily defined
binding pocket.
- class lXtractor.core.pocket.Pocket(definition, name='Pocket')[source]
Bases:
objectA binding pocket.
The pocket is defined via a single string following a particular syntax (a definition), such that, when applied to a ligand using
is_connected(), the latter outputsTrueif ligand is connected. Consequently, it is tightly bound tolXtractor.core.ligand.Ligand. Namely, the definition relies on two matrices:“c” =
lXtractor.core.ligand.Ligand.contact_mask(boolean mask)“d” =
lXtractor.core.ligand.Ligand.dist(distances)
The definition is a combination of statements. Each statement involves the selection consisting of a matrix (“c” or “d”), residue positions, and residue atom names, formatted as:
{matrix-prefix}:[pos]:[atom_names] {sign} {number}
where
[pos]and[atom_names]can be comma-separated lists,signis` a comparison operator, and anumber(intorfloat) is what to compare to. For instance, selectionc:1:CA,CB == 2translates into “must have exactly two contacts with atoms “CA” and “CB” at position 1. See more examples below.Comparison meaning depends on the matrix type used: “c” or “d”.
In the former case,
>= xmeans “at least x contacts”. In the latter case, “<= x” means “have distance below x”.In the case of the “d” matrix, applying selection and comparison will result in a vector of
boolbool values, requiring an aggregation. Two aggregation types are supported: “da” (any) and “daa” (all).In the case of the “c” matrix, possible matrix prefixes are “c” and “cs”. They have very different meanings! In the former case, the statements compares the total number of contacts when the selection is applied. In the latter case, the statement will select residues separately and, for each residue, decide whether the selected atoms form enough contact to extrapolate towards the full residue and mark it as “contacting” (controlled via min_contacts). These decisions are summed across each residue and this sum is compared to the number in the statement. See the example below.
Finally, statements can be bracketed and combined by boolean operators “AND” and “OR” (which one can abbreviate by “&” and “|”).
Examples:
At least two contacts with any atom of residues 1 and 5:
c:1,5:any >= 2
Note that the above is a “cumulative” statement, i.e., it is applied to both residues at the same time. Thus, if a residue 1 has two atoms contacting a ligand while a residue 2 has none, this will still evaluate to
True. The following definition will ensure that each residue has at least two contacts:c:1:any >= 2 & c:2:any >= 2
In contrast, the following statement will translate “among residues 1, 2, and 3, there are at least two “contacting” residues:
cs:1,2,3:any >= 2
Any atoms farther than 10A from alpha-carbons of positions 1 and 10:
da:1,10:CA > 10
Any atoms with at least two contacts with any atoms at position 1 or all CA atoms closer than 6A of positions 2 and 3:
c:1:any >= 2 | daa:2,3:CA < 6
CA or CB atoms with a contact at position 1 but not 2, while position 3 has any atoms below 10A threshold:
c:1:CA,CB >= 1 & c:2:CA,CB == 0 & da:3:any <= 10
Contact with positions 1 and 2 or positions 3 and 4:
(c:1:any >= 1 & c:2:any >= 1) | (c:3:any >= 1 & c:4:any >= 1)
See also
- is_connected(ligand, mapping=None, **kwargs)[source]
Check whether a ligand is connected.
- Parameters:
ligand (Ligand) – An arbitrary ligand.
mapping (dict[int, int] | None) – A mapping to the ligand’s parent structure numbering.
kwargs – Passed to
translate_definition().
- Returns:
Trueif the ligand is bound within the pocket andFalseotherwise.- Return type:
bool
- definition
- name
- lXtractor.core.pocket.make_sel(pos, atoms)[source]
Make a selection string from positions and atoms.
>>> make_sel(1, 'any') '(a.res_id == 1)' >>> make_sel([1, 2], 'CA,CB') "np.isin(a.res_id, [1, 2]) & np.isin(a.atom_name, ['CA', 'CB'])"
- Parameters:
pos (int | Sequence[int])
atoms (str)
- Returns:
- Return type:
str
- lXtractor.core.pocket.translate_definition(definition, mapping=None, *, skip_unmapped=False, min_contacts=1)[source]
Translates the
Pocket.definitioninto a series of statements, such that, when applied to ligand matrices, evaluate tobool.>>> translate_definition("c:1:any > 1") '(c[np.isin(a.res_id, [1])].sum() > 1)' >>> translate_definition("da:1,2:CA,CZ <= 6") "(d[np.isin(a.res_id, [1, 2]) & np.isin(a.atom_name, ['CA', 'CZ'])] <= 6).any()" >>> translate_definition("daa:1,2:any > 2", {1: 10}, skip_unmapped=True) '(d[np.isin(a.res_id, [10])] > 2).all()' >>> translate_definition("cs:1,2:any > 2") 'sum([c[(a.res_id == 1)].sum() >= 1, c[(a.res_id == 2)].sum() >= 1]) > 2'
Warning
skip_unmapped=Truemay change the pocket’s definition and lead to undesired conclusions. Caution advised!- Parameters:
definition (str) – A string definition of a
Pocket.mapping (dict[int, int] | None) – An optional mapping from the definition’s numbering to a structure’s numbering.
skip_unmapped (bool) – If the mapping is provided and some position is left unmapped, skip this position.
min_contacts (int) – If prefix is “cs”, use this threshold to determine a minimum number of residue contacts required to consider it bound.
- Returns:
A new string with statements of the provided definition translated into a numpy syntax.
- Return type:
str
lXtractor.core.interface module
The module defines Interface, representing an interface between
two partners in a molecular structure.
- class lXtractor.core.interface.AtomNode(idx, atom, has_edge=False)[source]
Bases:
objectRepresents an atom node in the interface graph.
- __init__(idx, atom, has_edge=False)
- atom: Atom
The
Atomobject of biotite.
- has_edge: bool = False
Indicates if the atom is involved in any contacts
- idx: int
Index of the atom in the parent structure
- class lXtractor.core.interface.ContactEdge(atom_a, atom_b, dist)[source]
Bases:
objectRepresents a contact edge in the interface graph.
By convention, in an edge
(i, j),ibelongs to partner chains “a”, whereasjbelongs to partner chains “b”.- __init__(atom_a, atom_b, dist)
- classmethod from_node_indices(i, j, g)[source]
Create a ContactEdge from node indices in the graph.
- Parameters:
i (int) – Index of the first atom node.
j (int) – Index of the second atom node.
g (rx.PyGraph) – The interface graph.
- Returns:
A new ContactEdge instance.
- Return type:
t.Self
- dist: float
Distance between the two atoms
- class lXtractor.core.interface.Interface(parent_structure, partners, subset_parent_to_partners=True, cutoff=6.0, graph=None)[source]
Bases:
objectAn asymmetric interface between two partners in a molecular structure.
The interface is defined by two distinct sets of partner chains (typically protein), designated as “a” and “b”, and their interactions. The interface is constructed using a graph representation where nodes are atoms and edges represent contacts between atoms from different partners. For a given edge
(i, j),ibelong to “a” and “b” chain groups, resp. A spatial tree (KD-tree) is used to efficiently compute these contacts within a specified cutoff distance.The class provides methods to analyze the interface, including:
Retrieving contact atoms and indices for each partner
Counting contacts and interacting residues
Calculating SASA and BSA for each partner and the complex
Splitting the interface into connected components
Two Interfaces are considered equal if they have the same partners and their graphs are identical (same nodes and edges).
- Parameters:
cutoff (float) – The maximum distance (in Angstroms) for considering two atoms to be in contact.
partners (_Partners) – Two tuples of chain identifiers representing the two sides of the interface (“a” and “b” partners).
is_subset – Indicates if the interface is a subset of the parent structure.
Note
The asymmetric nature of the interface means that methods and properties often have separate versions for “a” and “b” partners (e.g., mask_a and mask_b).
See also
AtomNodeA node of the interface graph.ContactEdgeAn edge of the interface graph.Warning
The
parent_structure()is considered immutable, while theG()can only change edges and their properties; the atom nodes must stay the same.- __init__(parent_structure, partners, subset_parent_to_partners=True, cutoff=6.0, graph=None)[source]
Initialize the Interface object.
- Parameters:
parent_structure (GenericStructure) – The parent structure containing the interface.
partners (tuple[Sequence[str] | str, Sequence[str] | str] | str) – Chain identifiers for the two sides of the interface. Can be a string “A_B” or a tuple of two sequences of chain ids.
subset_parent_to_partners (bool) – If
True, subset the parent structure to only include atoms from the specified partners.cutoff (float) – The maximum distance for considering two atoms to be in contact.
graph (PyGraph | None) – A pre-computed graph representing the interface contacts. If None, a new graph will be created.
- Raises:
AssertionError – If the cutoff is not greater than zero.
MissingData – If either set of partners is empty.
AmbiguousData – If the two sets of partners overlap.
LengthMismatch – If the provided graph doesn’t match the structure.
- count_contact_atoms(chain_ids=None, strict=False)[source]
Count the number of atoms involved in contacts. Equivalent to the total number of nodes connected with edges in
G().- Parameters:
chain_ids (Sequence[str] | str | None) – Optional; count contact atoms involving the provided chains.
strict (bool) – Used only if chain_ids is provided. If
True, will filter to atoms from provided chain_ids. Otherwise, will count all atoms making contacts with specified chain_ids.
- Returns:
The number of unique atoms involved in contacts.
- Return type:
int
- count_contact_residues(chain_ids=None, strict=False)[source]
Count the number of residues involved in contacts.
- Parameters:
chain_ids (Sequence[str] | str | None) – Optional; include results only for the provided chains.
strict (bool) – Used only if chain_ids is provided. If
True, will filter to residues from provided chain_ids. Otherwise, will count all residues making contacts with specified chain_ids.
- Returns:
The number of unique residues involved in contacts.
- Return type:
int
- count_contacts(chain_ids=None)[source]
Count the number of contacts in the interface. Equivalent to a number of edges in
G().- Parameters:
chain_ids (Sequence[str] | str | None) – Optional; count counts involving the provided chains.
- Returns:
The number of atom-atom contacts in the interface
- Return type:
int
- get_contact_atoms(chain_ids=None)[source]
Get the contacting atoms from both partners.
- Parameters:
chain_ids (Sequence[str] | str | None) – Optional; include results only for the provided chains.
- Returns:
A tuple of two AtomArrays of equal sizes containing the contacting atoms from partners “a” and “b” respectively.
- Return type:
tuple[AtomArray, AtomArray]
- get_contact_atoms_mask(chain_ids=None)[source]
Get a mask pointing to contact atoms.
- Parameters:
chain_ids (Sequence[str] | str | None) – Contacts must involve the provided chains.
- Returns:
An array where
Truepoints to atoms involved in interface contacts.- Return type:
ndarray[tuple[int, …], dtype[bool]]
- get_contact_idx(chain_ids=None)[source]
Get the indices of contacting atom pairs.
- Parameters:
chain_ids (Sequence[str] | str | None) – Optional list of chain IDs to filter the contacts.
- Returns:
A numpy array of shape (N, 2) where each row contains the indices of a contacting atom pair.
- Return type:
ndarray[tuple[int, …], dtype[int]]
- get_contact_idx_a(chain_ids=None)[source]
- Parameters:
chain_ids (Sequence[str] | str | None) – Optional; contacts must involve the provided chains.
- Returns:
A numpy array of indices of contacting atoms from partner “a”.
- Return type:
ndarray[tuple[int, …], dtype[int]]
- get_contact_idx_b(chain_ids=None)[source]
- Parameters:
chain_ids (Sequence[str] | str | None) – Optional; contacts must involve the provided chains.
- Returns:
A numpy array of indices of contacting atoms from partner “b”.
- Return type:
ndarray[tuple[int, …], dtype[int]]
- iter_ccs(as_='nodes', min_nodes=2)[source]
Iterate over the connected components of the interface graph.
- Parameters:
as –
The format to return the connected components. Options are:
”nodes” (default): List of node indices
”subgraph”: Subgraph of the interface graph
”edges”: List of edge indices
”atoms”: AtomArray of the atoms in the connected component
min_nodes (int) – Minimum number of nodes for a connected component to be included. Should be
>=2.
- Returns:
An iterator over the connected components in the specified format.
- Return type:
Iterator[list[int]] | Iterator[PyGraph] | Iterator[AtomArray]
- classmethod read(path)[source]
Read an Interface object from a file.
- Parameters:
path (PathLike | str) – Path to the directory containing the interface files.
- Returns:
An Interface object.
- Raises:
FileNotFoundError – If required files are missing.
MissingData – If required metadata is missing.
- Return type:
t.Self
- sasa(mask=None, canonical=True)[source]
Calculate the Solvent Accessible Surface Area (SASA) for the interface. See
InterfaceSASAfor more details.- Parameters:
mask (ndarray[tuple[int, ...], dtype[bool]] | None) – A custom atom mask of
parent_structure()pointing to atoms to include in calculation.canonical (bool) – Use only atoms of canonical residues for calculating SASA. In some cases, this may save from unexpected exceptions that happen due to biotite missing some atoms in non-canonical residues that it expects to be there (atomic radii are required for each atom for SASA calculation).
- Returns:
An InterfaceSASA object containing SASA values for partners “a” and “b” individually and in complex.
- Return type:
- split_connected(condition_a=RetainCondition(min_atoms=1, min_res=1, selector=None), condition_b=RetainCondition(min_atoms=1, min_res=1, selector=None), conditions_op=<built-in function and_>, conditions_apply_to='chains', into_pairs=False, subset_parent_to_partners=True, cutoff=6.0)[source]
Split the interface into connected components based on specified conditions.
This method allows for sophisticated filtering and splitting of the interface based on user-defined conditions. It can be used to identify specific sub-interfaces or to analyze the interface at different levels of granularity.
- Parameters:
condition_a (abc.Callable[[bst.AtomArray], bool]) – Conditions for filtering partner “a” components. Can be any callable accepting an arbitrary atom array corresponding to an interface “side” and returning a boolean.
condition_b (abc.Callable[[bst.AtomArray], bool]) – Conditions for filtering partner “b” components.
conditions_op (abc.Callable[[bool, bool], bool]) – Operator to combine conditions_a and conditions_b. Default is operator.and_ (both conditions must be met).
conditions_apply_to (str) – Whether to apply conditions to “chains” (default) or individual connected components.
into_pairs (bool) – If True, split into pairwise interfaces between individual chains.
subset_parent_to_partners (bool) – If True, subset the parent structure in the resulting interfaces.
cutoff (float) – Distance cutoff for contacts in the resulting interfaces.
- Returns:
An iterator of
Interfaceobjects representing the split components.- Return type:
abc.Iterator[t.Self]
Note
The RetainCondition objects allow for flexible filtering based on number of atoms, residues, or custom selectors. This enables complex splitting strategies, such as retaining only interfaces with a minimum number of interacting residues or specific types of interactions. The default retain condition is to have at least one contact residue.
See also
RetainConditionfor details on how to specify filtering conditions.
- write(base_dir, overwrite=False, name=None, str_fmt='mmtf.gz', additional_meta=None)[source]
Write the Interface object to files.
- Parameters:
base_dir (PathLike | str) – Base directory to write the files.
overwrite (bool) – If True, overwrite existing files.
name (str | None) – Name for the interface directory (default is the interface ID).
str_fmt (str) – Format for writing the structure file.
additional_meta (dict[str, t.Any] | True | None) – Additional metadata to include in the JSON file. If
dict, ads it to the default metadata records. IfTrue, includesInterfaceSASA.
- Returns:
Path to the destination directory.
- Return type:
Path
- property G: PyGraph
Get the graph representation of the interface contacts.
- Returns:
A graph where nodes represent atoms and edges represent contacts between atoms from different partners.
- property id: str
Get the unique identifier for this Interface.
- Returns:
A string representing the interface in the format “Interface(partners)<-(parent_structure)”.
- property mask_a: ndarray[tuple[int, ...], dtype[bool]]
- Returns:
A numpy array of booleans,
Truefor atoms in the first partner group.
- property mask_b: ndarray[tuple[int, ...], dtype[bool]]
- Returns:
A numpy array of booleans,
Truefor atoms in the second partner group.
- property parent_structure: GenericStructure
Get the parent structure of the interface.
- Returns:
The parent structure containing this interface.
- property partners_a: tuple[str, ...]
- Returns:
A tuple of chain identifiers for the first partner group.
- property partners_b: tuple[str, ...]
- Returns:
A tuple of chain identifiers for the second partner group.
- property partners_fmt: str
Get a formatted string representation of the partners.
- Returns:
A string in the format “A_B” where A and B are the concatenated chain identifiers of each partner.
- property partners_joined: list[str]
- Returns:
A list of all chain identifiers from both partners.
- class lXtractor.core.interface.InterfaceComparator(state_ref, state_mob, superpose_by='a', min_spp_atoms=5)[source]
Bases:
objectA class for comparing interfaces corresponding to different states of the same binding partners. It assumes that parent structures of these states have the same atoms in the same order but perhaps with different coordinates. To check if the interfaces are comparable, one may use
are_comparable()before initializing.It superposes parent structure of
state_moboverstate_refduring initialization. Then, common metrics such asirmsd(),lrmsd()anddockq()can be computed fast and reliably.- __init__(state_ref, state_mob, superpose_by='a', min_spp_atoms=5)[source]
- Parameters:
state_ref (Interface) – A reference state of the interface.
state_mob (Interface) – A mobile state of the interface. Its structure copy after superposition will be stored after init and can be accessed via
superposed_mob().superpose_by (str | ndarray) – Defines which set of atoms is used to superpose mobile state over the fixed one. Can be a
"a"or"b"to indicate corresponding binding partners or astrwith “,”-separated chains. Can also be a` numpy` array with atom indices or boolean mask pointing to atoms to use for superposition.min_spp_atoms (int) – Minimum number of atoms necessary to superpose structures after superpose_by is applied.
- Raises:
AmbiguousData – if interfaces are not comparable.
Note
A
strtype ofsuperpose_byis used to filter the interface contacts to specified chains. For instance, setting it with"A"will result in the selection of atom contact atoms involving chain A as opposed to using only chain A contacts for superposition. Hence, values"a"and"b"are essentially equivalent since they’ll result in the same selection of contact atoms: those involved in the interface formation.
- classmethod are_comparable(state1, state2)[source]
A method to check whether two interface states are comparable to be used in this class.
- dockq(d1=8.5, d2=1.5)[source]
A DockQ score from [Basu and Wallner, 2016].
- Parameters:
- Returns:
DockQ score ranging from 0 (no match) to 1 (perfect match).
- Return type:
float
[1]Sankar Basu and Björn Wallner. Dockq: a quality measure for protein-protein docking models. PLOS ONE, 11(8):1–9, 08 2016. URL: https://doi.org/10.1371/journal.pone.0161879, doi:10.1371/journal.pone.0161879.
[2]Gerard JP van Westen, Remco F Swier, Jörg K Wegner, Adriaan P IJzerman, Herman WT van Vlijmen, and Andreas Bender. Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets. Journal of Cheminformatics, 5(1):41–41, 2013. doi:10.1186/1758-2946-5-41.
[3]Gerard JP van Westen, Remco F Swier, Isidro Cortes-Ciriano, Jörg K Wegner, John P Overington, Adriaan P IJzerman, Herman WT van Vlijmen, and Andreas Bender. Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets. Journal of Cheminformatics, 5(1):42, 2013. doi:10.1186/1758-2946-5-42.
- irmsd()[source]
Compute interface RMSD.
- Returns:
RMSD computed over atoms comprising interface of the
state_ref.- Return type:
float
- lrmsd(ligand_chains='b')[source]
Compute “ligand” RMSD.
- Parameters:
ligand_chains (str | Sequence[str]) – Specification of which chains to consider “ligand”. By default, this points to
Interface.partner_b().- Returns:
RMSD computed over chains posing as “ligand”.
- Return type:
float
- rmsd_over(atom_mask)[source]
A general-purpose method to compute RMSD between reference and mobile states over arbitrary set of atoms.
- Parameters:
atom_mask (ndarray[tuple[int, ...], dtype[bool]]) – A boolean mask where
Trueindicates target atoms.- Returns:
RMSD over target atoms.
- Return type:
float
- state_mob
Mobile interface state.
- state_ref
Reference interface state.
- superpose_by
Superpose selection specifications.
- property superposed_mob: GenericStructure
- Returns:
A copy of mobile structure with coordinates transformed following superpositions.
- class lXtractor.core.interface.InterfaceSASA(a_free, b_free, a_complex, b_complex, complex)[source]
Bases:
objectStores Solvent Accessible Surface Area (SASA) values for an interface.
- __init__(a_free, b_free, a_complex, b_complex, complex)
- as_record()[source]
- Returns:
return SASA and BSA values as a dictionary.
- Return type:
dict[str, float]
- a_complex: float
SASA of partner “a” in the complex.
- a_free: float
SASA of partner “a” alone.
- b_complex: float
SASA of partner “b” in the complex.
- b_free: float
SASA of partner “b” alone.
- property bsa_a: float
- property bsa_b: float
- property bsa_complex: float
- Returns:
Total buried surface area. Computed as a difference between the sum of the free a and b and the complex SASAs.
- complex: float
SASA of the entire complex.
- class lXtractor.core.interface.RetainCondition(min_atoms=1, min_res=1, selector=None)[source]
Bases:
objectDefines conditions for retaining an interface component.
This condition is applied to interface parts of either a or b partners in
Interface.split_connected(). and then combined into a decision whether this interface part is to be retained upon splitting.- __init__(min_atoms=1, min_res=1, selector=None)
- apply(a, return_counts=False)[source]
Apply the retention condition to an atom array.
- Parameters:
a (AtomArray) – The atom array to check.
return_counts (bool) – If True, return atom and residue counts instead of a boolean.
- Returns:
Boolean indicating if the condition is met, or atom and residue counts if return_counts is True.
- Return type:
bool | tuple[int, int]
- min_atoms: int = 1
Minimum number of atoms required
- min_res: int = 1
Minimum number of residues required
- selector: Callable[[AtomArray], AtomArray] | None = None
Optional function to select specific atoms