lXtractor.core package

lXtractor.core.alignment module

A module handling multiple sequence alignments.

class lXtractor.core.alignment.Alignment(seqs, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]

Bases: object

An MSA resource: a collection of aligned sequences.

__init__(seqs, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]
Parameters:
  • seqs (Iterable[tuple[str, str]]) – An iterable with (id, _seq) pairs.

  • add_method (AddMethod) – A callable adding sequences. Check the type for a signature.

  • align_method (AlignMethod) – A callable aligning sequences.

add(other)[source]

Add sequences to existing ones using add(). This is similar to align() but automatically adds the aligned seqs.

>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')])
>>> aa = a.add(('Y', 'ABXD'))
>>> aa.shape
(3, 4)
Parameters:

other (abc.Iterable[_ST] | _ST | Alignment) – A sequence, iterable over sequences, or another Alignment.

Returns:

A new Alignment object with added sequences.

Return type:

t.Self

align(seq)[source]

Align (add) sequences to this alignment via add_method.

>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')])
>>> aa = a.align(('Y', 'ABXD'))
>>> aa.shape
(1, 4)
>>> aa.seqs
[('Y', 'ABXD')]
Parameters:

seq (abc.Iterable[_ST] | _ST | Alignment) – A sequence, iterable over sequences, or another Alignment.

Returns:

A new alignment object with sequences from _seq. The original number of columns should be preserved, which is true when using the default add_method.

Return type:

t.Self

annotate(objs, map_name, accept_fn=None, **kwargs)[source]

This function “annotates” sequence segments using MSA.

Namely, it adds each sequence of the provided chain-type objects to sequences currently present in this MSA via add_method. The latter is expected to preserve the original number of MSA columns, whereas potentially cutting the original sequence, thereby defining MSA-imposed boundaries. These are used to extract a child object using spawn_child method, which will have the corresponding MSA numbering written under map_name.

Parameters:
  • objs (abc.Iterable[_CT]) – An iterable over chain-type objects.

  • map_name (str) – A name to use for storing the derived MSA numbering map.

  • accept_fn (abc.Callable[[_CT], bool] | None) – A function accepting a chain-type object and returning a boolean value indicating whether the spawn child sequence should be preserved.

  • kwargs – Additional keyword arguments passed to the spawn_child() method.

Returns:

An iterator over spawned child objects. These are automatically stored under the children attribute of each chain-type object, in which case it’s safe to simply consume the returned iterator.

filter(fn)[source]

Filter alignment sequences.

Parameters:

fn (SeqFilter) – A function accepting a sequence – (name, _seq) pair – and returning a boolean.

Returns:

A new Alignment object with filtered sequences.

Return type:

t.Self

filter_gaps(max_frac=1.0, dim=0)[source]

Filter sequences or alignment columns having >= max_frac of gaps.

>>> a = Alignment([('A', 'AB---'), ('X', 'XXXX-'), ('Y', 'YYYY-')])

By default, the max_frac gaps is 1.0, which would remove solely gap-only sequences.

>>> aa = a.filter_gaps(dim=0)
>>> aa == a
True

Specifying max_frac removes sequences with over 50% gaps.

>>> aa = a.filter_gaps(dim=0, max_frac=0.5)
>>> 'A' not in aa
True

The last column is removed.

>>> a.filter_gaps(dim=1).shape
(3, 4)
Parameters:
  • max_frac (float) – a maximum fraction of allowed gaps in a sequence or a column.

  • dim (int) – 0 for sequences, 1 for columns.

Returns:

A new Alignment object with filtered sequences or columns.

Return type:

t.Self

itercols(*, join=True)[source]

Iterate over the Alignment columns.

>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')])
>>> list(a.itercols())
['AX', 'BX', 'CX', 'DX']
Parameters:

join (bool) – Join columns into a string.

Returns:

An iterator over columns.

Return type:

Iterator[str] | Iterator[list[str]]

classmethod make(seqs, method=<function mafft_align>, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]

Create a new alignment from a collection of unaligned sequences. For aligned sequences, please utilize read().

Parameters:
  • seqs (Iterable[tuple[str, str]]) – An iterable over (header, _seq) objects.

  • method (AlignMethod) – A callable accepting unaligned sequences and returning the aligned ones.

  • add_method (AddMethod) – A sequence addition method for a new Alignment object.

  • align_method (AlignMethod) – An alignment method for a new Alignment object.

Returns:

An alignment created from aligned seqs.

Return type:

Alignment

map(fn)[source]

Map a function to sequences.

>>> a = Alignment([('A', 'AB---')])
>>> a.map(lambda x: (x[0].lower(), x[1].replace('-', '*'))).seqs
[('a', 'AB***')]
Parameters:

fn (SeqMapper) – A callable accepting and returning a sequence.

Returns:

A new Alignment object.

Return type:

t.Self

classmethod read(inp, read_method=<function read_fasta>, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]

Read sequences and create an alignment.

Parameters:
  • inp (Path | TextIOBase | abc.Iterable[str]) – A Path to aligned sequences, or a file handle, or iterable over file lines.

  • read_method (SeqReader) – A method accepting inp and returning an iterable over pairs (header, _seq). By default, it’s read_fasta(). Hence, the default expected format is fasta.

  • add_method (AddMethod) – A sequence addition method for a new Alignment object.

  • align_method (AlignMethod) – An alignment method for a new Alignment object.

Returns:

An alignment with sequences read parsed from the provided input.

Return type:

t.Self

classmethod read_make(inp, read_method=<function read_fasta>, add_method=<function mafft_add>, align_method=<function mafft_align>)[source]

A shortcut combining read() and make().

It parses sequences from inp, aligns them and creates

the Alignment object.

Parameters:
  • inp (Path | TextIOBase | abc.Iterable[str]) – A Path to aligned sequences, or a file handle, or iterable over file lines.

  • read_method (SeqReader) – A method accepting inp and returning an iterable over pairs (header, _seq). By default, it’s read_fasta(). Hence, the default expected format is fasta.

  • add_method (AddMethod) – A sequence addition method for a new Alignment object.

  • align_method (AlignMethod) – An alignment method for a new Alignment object.

Returns:

An alignment from parsed and aligned inp sequences.

Return type:

t.Self

realign()[source]

Realign sequences in seqs using align_method.

Returns:

A new Alignment object with realigned sequences.

remove(item, error_if_missing=True, realign=False)[source]

Remove a sequence or collection of sequences.

>>> a = Alignment([('A', 'ABCD-'), ('X', 'XXXX-'), ('Y', 'YYYYY')])
>>> aa = a.remove('A')
>>> 'A' in aa
False
>>> aa = a.remove(('Y', 'YYYYY'))
>>> aa.shape
(2, 5)
>>> aa = a.remove(('Y', 'YYYYY'), realign=True)
>>> aa.shape
(2, 4)
>>> aa['A']
'ABCD'
>>> aa = a.remove(['X', 'Y'])
>>> aa.shape
(1, 5)
Parameters:
  • item (str | _ST | t.Iterable[str] | t.Iterable[_ST]) –

    One of the following:

    • A str: a sequence’s name.

    • A pair (str, str) – a name with the sequence itself.

    • An iterable over sequence enames or pairs (not mixed!)

  • error_if_missing (bool) – Raise an error if any of the items are missing.

  • realign (bool) – Realign seqs after removal.

Returns:

A new Alignment object with the remaining sequences.

Return type:

t.Self

slice(start, stop, step=None)[source]

Slice alignment columns.

>>> a = Alignment([('A', 'ABCD'), ('X', 'XXXX')])
>>> aa = a.slice(1, 2)
>>> aa.shape == (2, 2)
True
>>>
>>> aa.seqs[0]
('A', 'AB')
>>> aa = a.slice(-4, 10)
>>> aa.seqs[0]
('A', 'ABCD')

To add the aligned sequences to the existing ones, use + or add():

>>> aaa = a + aa
>>> aaa.shape
(3, 4)
Parameters:
  • start (int) – Start coordinate, boundaries inclusive.

  • stop (int) – Stop coordinate, boundaries inclusive.

  • step (int | None) – Step for slicing, i.e., take every column separated by step - 1 number of columns.

Returns:

A new alignment with sequences subset according to the slicing params.

Return type:

t.Self

write(out, write_method=<function write_fasta>)[source]

Write an alignment.

Parameters:
  • out (Path | SupportsWrite) – Any object with the write method.

  • write_method (SeqWriter) – The writing function itself, accepting sequences and out. By default, use read_fasta to write in fasta format.

Returns:

Nothing.

Return type:

None

add_method: AddMethod
align_method: AlignMethod
seqs: list[tuple[str, str]]
property shape: tuple[int, int]
Returns:

(# sequences, # columns)

lXtractor.core.base module

Base classes, commong types and functions for the core module.

class lXtractor.core.base.AbstractResource(resource_path, resource_name)[source]

Bases: object

Abstract base class defining basic interface any resource must provide.

__init__(resource_path, resource_name)[source]
Parameters:
  • resource_path (str | Path) – Path to parsed resource data.

  • resource_name (str | None) – Resource’s name.

abstract dump(path)[source]

Save the resource under the given path.

abstract fetch(url)[source]

Download the resource.

abstract parse()[source]

Parse the read resource, so it’s ready for usage.

abstract read()[source]

Read the resource using the resource_path

class lXtractor.core.base.AddMethod(*args, **kwargs)[source]

Bases: Protocol

A callable to add sequences to the aligned ones, preserving the alignment length.

__call__(msa, seqs)[source]

Call self as a function.

Return type:

Iterable[tuple[str, str]]

__init__(*args, **kwargs)
class lXtractor.core.base.AlignMethod(*args, **kwargs)[source]

Bases: Protocol

A callable to align arbitrary sequences.

__call__(seqs)[source]

Call self as a function.

Return type:

Iterable[tuple[str, str]]

__init__(*args, **kwargs)
class lXtractor.core.base.ApplyT(*args, **kwargs)[source]

Bases: Protocol[T]

__call__(x)[source]

Call self as a function.

Return type:

T

__init__(*args, **kwargs)
class lXtractor.core.base.ApplyTWithArgs(*args, **kwargs)[source]

Bases: Protocol[T]

__call__(x, *args, **kwargs)[source]

Call self as a function.

Return type:

T

__init__(*args, **kwargs)
class lXtractor.core.base.FilterT(*args, **kwargs)[source]

Bases: Protocol[T]

__call__(x)[source]

Call self as a function.

Return type:

bool

__init__(*args, **kwargs)
class lXtractor.core.base.NamedTupleT(*args, **kwargs)[source]

Bases: Protocol, Iterable

__init__(*args, **kwargs)
class lXtractor.core.base.Ord(*args, **kwargs)[source]

Bases: Protocol[_T]

Any objects defining comparison operators.

__init__(*args, **kwargs)
class lXtractor.core.base.ResNameDict[source]

Bases: UserDict

A dictionary providing mapping between PDB residue names and their one-letter codes. The mapping was parsed from the CCD and can be obtained by calling lXtractor.ext.ccd.CCD.make_res_name_map().

>>> d = ResNameDict()
>>> assert d['ALA'] == 'A'
__init__()[source]
class lXtractor.core.base.SeqFilter(*args, **kwargs)[source]

Bases: Protocol

A callable accepting a pair (header, _seq) and returning a boolean.

__call__(seq, **kwargs)[source]

Call self as a function.

Return type:

bool

__init__(*args, **kwargs)
class lXtractor.core.base.SeqMapper(*args, **kwargs)[source]

Bases: Protocol

A callable accepting and returning a pair (header, _seq).

__call__(seq, **kwargs)[source]

Call self as a function.

Return type:

tuple[str, str]

__init__(*args, **kwargs)
class lXtractor.core.base.SeqReader(*args, **kwargs)[source]

Bases: Protocol

A callable reading sequences into tuples of (header, _seq) pairs.

__call__(inp)[source]

Call self as a function.

Return type:

Iterable[tuple[str, str]]

__init__(*args, **kwargs)
class lXtractor.core.base.SeqWriter(*args, **kwargs)[source]

Bases: Protocol

A callable writing (header, _seq) pairs to disk.

__call__(inp, out)[source]

Call self as a function.

__init__(*args, **kwargs)
class lXtractor.core.base.SupportsWrite(*args, **kwargs)[source]

Bases: Protocol

Any object with the write method.

__init__(*args, **kwargs)
write(data)[source]

Write the supplied data.

class lXtractor.core.base.UrlGetter(*args, **kwargs)[source]

Bases: Protocol

A callable accepting some string arguments and turning them into a valid url.

__call__(*args)[source]

Call self as a function.

Return type:

str

__init__(*args, **kwargs)

lXtractor.core.config module

A module encompassing various settings of lXtractor objects.

class lXtractor.core.config.AtomMark(value)[source]

Bases: IntFlag

The atom categories. Some categories may be combined, e.g., LIGAND | PEP is another valid category denoting ligand peptide atoms.

CARB: int = 32

Carbohydrate polymer atoms.

COVALENT: int = 64

Covalent polymer modifications including ligands.

LIGAND: int = 4

Ligand atom. If not combined with PEP, NUC, or CARB, this category denotes non-polymer (small molecule) single-residue ligands.

NUC: int = 16

Nucleotide polymer atoms.

PEP: int = 8

Peptide polymer atoms.

SOLVENT: int = 2

Solvent atom.

UNK: int = 1

Unknown atom.

class lXtractor.core.config.Config(default_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/default_config.json'), user_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/user_config.json'))[source]

Bases: UserDict

A configuration management class.

This class facilitates the loading and saving of configuration settings, with a user-specified configuration overriding the default settings.

Parameters:
  • default_config_path (str | Path) – The path to the default config file. This is a reference default settings, which can be used to reset user settings if needed.

  • user_config_path (str | Path) – The path to the user configuration file. This file is stored internally and can be modified by a user to provide permanent settings.

Loading and mofifying the config:

>>> cfg = Config()
>>> list(cfg.keys())[:2]
['bonds', 'colnames']
>>> cfg['bonds']['non_covalent_upper']
5.0
>>> cfg['bonds']['non_covalent_upper'] = 6

Equivalently, one can update the config by a local JSON file or dict:

>>> cfg.update_with({'bonds': {'non_covalent_upper': 4}})
>>> assert cfg['bonds']['non_covalent_upper'] == 4

The changes can be stored internally and loaded automatically in the future:

>>> cfg.save()
>>> cfg = Config()
>>> assert cfg['bonds']['non_covalent_upper'] == 4

To restore default settings:

>>> cfg.reset_to_defaults()
>>> cfg.clear_user_config()
__init__(default_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/default_config.json'), user_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/user_config.json'))[source]
clear_user_config()[source]

Clear the contents of the locally stored user config file.

reload()[source]

Load the configuration from files.

reset_to_defaults()[source]

Reset the configuration to the default settings.

save(user_config_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/lxtractor/checkouts/latest/lXtractor/resources/user_config.json'))[source]

Save the current configuration. By default, will store the configuration internally. This stored configuration will be loaded automatically on top of the default configuration.

Parameters:

user_config_path (str | Path) – The path where to save the user configuration file.

Raises:

ValueError – If the user config path is not provided.

temporary_namespace()[source]

A context manager for a temporary config namespace.

Within this context, changes to the config are allowed, but will be reverted back to the original config once the context is exited.

Example:

>>> cfg = Config()
>>> with cfg.temporary_namespace():
...     cfg['bonds']['non_covalent_upper'] = 10
...     # Do some stuff with the temporary config...
... # Config is reverted back to original state here
>>> assert cfg['bonds']['non_covalent_upper'] != 10
update_with(other)[source]
lXtractor.core.config.serialize_json_value(obj)[source]

Recursively convert objects to a JSON-serializable form.

lXtractor.core.exceptions module

exception lXtractor.core.exceptions.AmbiguousData[source]

Bases: ValueError

exception lXtractor.core.exceptions.AmbiguousMapping[source]

Bases: ValueError

exception lXtractor.core.exceptions.ConfigError[source]

Bases: ValueError

Some configuration problem.

exception lXtractor.core.exceptions.FailedCalculation[source]

Bases: RuntimeError

exception lXtractor.core.exceptions.FormatError[source]

Bases: ValueError

exception lXtractor.core.exceptions.InitError[source]

Bases: ValueError

A broad category exception for problems with an object’s initialization

exception lXtractor.core.exceptions.LengthMismatch[source]

Bases: ValueError

exception lXtractor.core.exceptions.MissingData[source]

Bases: ValueError

exception lXtractor.core.exceptions.NoOverlap[source]

Bases: ValueError

exception lXtractor.core.exceptions.OverlapError[source]

Bases: ValueError

exception lXtractor.core.exceptions.ParsingError[source]

Bases: ValueError

lXtractor.core.ligand module

class lXtractor.core.ligand.Ligand(parent, mask, contact_mask, ligand_idx, dist, meta=None)[source]

Bases: object

Ligand object is a part of the structure falling under certain criteria.

Namely, a ligand is a non-polymer and non-solvent molecule or a single monomer. Such ligands will be designated using the format:

{res_name}_{res_id}:{chain_id}<-({parent})

If a ligand contains multiple monomers, by convention, this is a polymer ligand. Such ligands should be named using the first letter of the polymer type; one of the ("p", "n", "c"). In this case, it’s ID will be of the following format:

{polymer_type}_{min_res_id}-{max_res_id}:{chain_id}<-({parent})

This information is provided by meta and shouldn’t be changed. However, any additional fields can be stored in meta which will be retrieved when constructing summary().

Attributes mask and contact_mask are boolean masks allowing to obtain ligand and ligand-contacting atoms from parent.

..seealso ::

make_ligand() to initialize a new ligand in an easy way.

__init__(parent, mask, contact_mask, ligand_idx, dist, meta=None)[source]
is_locally_connected(mask)[source]

Check whether this ligand is connected to a subset of parent atoms.

Parameters:

mask (ndarray) – A boolean mask to filter parent atoms.

Returns:

True if the ligand has at least min_atom_connections to parent substructure imposed by the provided mask.

Return type:

bool

summary(meta=True)[source]
Return type:

Series

property array: AtomArray
Returns:

An array of ligand atoms within parent.

property chain_id: str
Returns:

Ligand chain ID.

contact_mask: np.ndarray

A boolean mask such that when applied to the parent, subsets the latter to its ligand-contacting atoms.

dist

An array of distances for each ligand-contacting parent’s atom.

property id: str
is_polymer
ligand_idx

An integer array with indices pointing to ligand atoms contacting the parent structure.

mask

A boolean mask such that when applied to the parent, subsets the latter to the ligand residues.

meta

A dictionary of meta info.

parent: GenericStructure

Parent structure.

property parent_contact_atoms: AtomArray
Returns:

An array of ligand-contacting atoms within parent.

property parent_contact_chains: set[str]
Returns:

A set of chain IDs involved in forming contacts with ligand.

property res_id: str
Returns:

Ligand residue number.

property res_name: str
Returns:

Ligand residue name.

lXtractor.core.ligand.ligands_from_atom_marks(structure)[source]
Return type:

abc.Generator[Ligand, None, None]

lXtractor.core.ligand.make_ligand(m_lig, m_pol, structure)[source]

Create a new Ligand object. The criteria to qualify for a ligand are defined by the global config (DefaultConfig["ligand"]).

Whether a ligand molecule is created is subject to several checks:

#. It has a certain number of atoms.
#. It has a certain number of contacts with the polymer.
#. It contacts a certain number of residues in the polymer.
#. Its atoms span a single chain.

If a ligand doesn’t pass any of these checks, the function returns None.

Parameters:
  • m_lig (npt.NDArray[np.bool_]) – A boolean mask pointing to putative ligand atoms.

  • m_pol (npt.NDArray[np.bool_]) – A boolean mask pointing to polymer atoms that supposedly contact ligand atoms.

  • structure (GenericStructure) – A parent structure to which the masks can be applied.

Returns:

An instantiated ligand or None if the checks were not passed.

Return type:

Ligand | None

lXtractor.core.pocket module

The module defines Pocket, representing an arbitrarily defined binding pocket.

class lXtractor.core.pocket.Pocket(definition, name='Pocket')[source]

Bases: object

A binding pocket.

The pocket is defined via a single string following a particular syntax (a definition), such that, when applied to a ligand using is_connected(), the latter outputs True if ligand is connected. Consequently, it is tightly bound to lXtractor.core.ligand.Ligand. Namely, the definition relies on two matrices:

  1. “c” = lXtractor.core.ligand.Ligand.contact_mask (boolean mask)

  2. “d” = lXtractor.core.ligand.Ligand.dist (distances)

The definition is a combination of statements. Each statement involves the selection consisting of a matrix (“c” or “d”), residue positions, and residue atom names, formatted as:

{matrix-prefix}:[pos]:[atom_names] {sign} {number}

where [pos] and [atom_names] can be comma-separated lists, sign is` a comparison operator, and a number (int or float) is what to compare to. For instance, selection c:1:CA,CB == 2 translates into “must have exactly two contacts with atoms “CA” and “CB” at position 1. See more examples below.

Comparison meaning depends on the matrix type used: “c” or “d”.

In the former case, >= x means “at least x contacts”. In the latter case, “<= x” means “have distance below x”.

In the case of the “d” matrix, applying selection and comparison will result in a vector of bool bool values, requiring an aggregation. Two aggregation types are supported: “da” (any) and “daa” (all).

In the case of the “c” matrix, possible matrix prefixes are “c” and “cs”. They have very different meanings! In the former case, the statements compares the total number of contacts when the selection is applied. In the latter case, the statement will select residues separately and, for each residue, decide whether the selected atoms form enough contact to extrapolate towards the full residue and mark it as “contacting” (controlled via min_contacts). These decisions are summed across each residue and this sum is compared to the number in the statement. See the example below.

Finally, statements can be bracketed and combined by boolean operators “AND” and “OR” (which one can abbreviate by “&” and “|”).

Examples:

At least two contacts with any atom of residues 1 and 5:

c:1,5:any >= 2

Note that the above is a “cumulative” statement, i.e., it is applied to both residues at the same time. Thus, if a residue 1 has two atoms contacting a ligand while a residue 2 has none, this will still evaluate to True. The following definition will ensure that each residue has at least two contacts:

c:1:any >= 2 & c:2:any >= 2

In contrast, the following statement will translate “among residues 1, 2, and 3, there are at least two “contacting” residues:

cs:1,2,3:any >= 2

Any atoms farther than 10A from alpha-carbons of positions 1 and 10:

da:1,10:CA > 10

Any atoms with at least two contacts with any atoms at position 1 or all CA atoms closer than 6A of positions 2 and 3:

c:1:any >= 2 | daa:2,3:CA < 6

CA or CB atoms with a contact at position 1 but not 2, while position 3 has any atoms below 10A threshold:

c:1:CA,CB >= 1 & c:2:CA,CB == 0 & da:3:any <= 10

Contact with positions 1 and 2 or positions 3 and 4:

(c:1:any >= 1 & c:2:any >= 1) | (c:3:any >= 1 & c:4:any >= 1)
__init__(definition, name='Pocket')[source]
is_connected(ligand, mapping=None, **kwargs)[source]

Check whether a ligand is connected.

Parameters:
  • ligand (Ligand) – An arbitrary ligand.

  • mapping (dict[int, int] | None) – A mapping to the ligand’s parent structure numbering.

  • kwargs – Passed to translate_definition().

Returns:

True if the ligand is bound within the pocket and False otherwise.

Return type:

bool

definition
name
lXtractor.core.pocket.make_sel(pos, atoms)[source]

Make a selection string from positions and atoms.

>>> make_sel(1, 'any')
'(a.res_id == 1)'
>>> make_sel([1, 2], 'CA,CB')
"np.isin(a.res_id, [1, 2]) & np.isin(a.atom_name, ['CA', 'CB'])"
Parameters:
  • pos (int | Sequence[int]) –

  • atoms (str) –

Returns:

Return type:

str

lXtractor.core.pocket.translate_definition(definition, mapping=None, *, skip_unmapped=False, min_contacts=1)[source]

Translates the Pocket.definition into a series of statements, such that, when applied to ligand matrices, evaluate to bool.

>>> translate_definition("c:1:any > 1")
'(c[np.isin(a.res_id, [1])].sum() > 1)'
>>> translate_definition("da:1,2:CA,CZ <= 6")
"(d[np.isin(a.res_id, [1, 2]) & np.isin(a.atom_name, ['CA', 'CZ'])] <= 6).any()"
>>> translate_definition("daa:1,2:any > 2", {1: 10}, skip_unmapped=True)
'(d[np.isin(a.res_id, [10])] > 2).all()'
>>> translate_definition("cs:1,2:any > 2")
'sum([c[(a.res_id == 1)].sum() >= 1, c[(a.res_id == 2)].sum() >= 1]) > 2'

Warning

skip_unmapped=True may change the pocket’s definition and lead to undesired conclusions. Caution advised!

Parameters:
  • definition (str) – A string definition of a Pocket.

  • mapping (dict[int, int] | None) – An optional mapping from the definition’s numbering to a structure’s numbering.

  • skip_unmapped (bool) – If the mapping is provided and some position is left unmapped, skip this position.

  • min_contacts (int) – If prefix is “cs”, use this threshold to determine a minimum number of residue contacts required to consider it bound.

Returns:

A new string with statements of the provided definition translated into a numpy syntax.

Return type:

str

lXtractor.core.segment module

Module defines a segment object serving as base class for sequences in lXtractor.

class lXtractor.core.segment.Segment(start, end, name='S', seqs=None, parent=None, children=None, meta=None, variables=None)[source]

Bases: Sequence[NamedTupleT]

An arbitrary segment with inclusive boundaries containing arbitrary number of sequences.

Sequences themselves may be retrieved via [] syntax:

>>> s = Segment(1, 10, 'S', seqs={'X': list(range(10))})
>>> s.id == 'S|1-10'
True
>>> s['X'] == list(range(10))
True
>>> 'X' in s
True

One can use the same syntax to check if a Segment contains certain index:

>>> 1 in s and 10 in s and not 11 in s
True

Iteration over the segment yields it’s items:

>>> next(iter(s))
Item(i=1, X=0)

One can just get the same item by explicit index:

>>> s[1]
Item(i=1, X=0)

Slicing returns an iterable slice object:

>>> list(s[1:2])
[Item(i=1, X=0), Item(i=2, X=1)]

One can add a new sequence in two ways.

  1. using a method:

>>> s.add_seq('Y', tuple(range(10, 20)))
>>> 'Y' in s
True
  1. using [] syntax:

>>> s['Y'] = tuple(range(10, 20))
>>> 'Y' in s
True

Note that using the first method, if s already contains Y, this will cause an exception. To overwrite a sequence with the same name, please use explicit [] syntax.

Additionally, one can offset Segment indices using >>/<< syntax. This operation mutates original Segment!

>>> s >> 1
S|2-11
>>> 11 in s
True
__init__(start, end, name='S', seqs=None, parent=None, children=None, meta=None, variables=None)[source]
Parameters:
  • start (int) – Start coordinate.

  • end (int) – End coordinate.

  • name (str) – The name of the segment. Name with start and end coordinates should uniquely specify the segment. They are used to dynamically construct id().

  • seqs (dict[str, abc.Sequence[t.Any]] | None) – A dictionary name => sequence, where sequence is some sequence (preferably mutable) bounded by segment. Name of a sequence must be “simple”, i.e., convertable to a field of a namedtuple.

  • parent (t.Self | None) – Parental segment bounding this instance, typically obtained via sub() or sub_by() methods.

  • children (abc.MutableSequence[t.Self] | None) – A mapping name => Segment with child segments bounded by this instance.

  • meta (dict[str, t.Any] | None) – A dictionary with any meta-information str() => str() since reading/writing meta to disc will inevitably convert values to strings.

  • variables (Variables | None) – A collection of variables calculated or staged for calculation for this segment.

add_seq(name, seq)[source]

Add sequence to this segment.

Parameters:
  • name (str) – Sequence’s name. Should be convertible to the namedtuple’s field.

  • seq (Sequence[Any]) – A sequence with arbitrary elements and the length of a segment.

Returns:

returns nothing. This operation mutates attr:`seqs.

Raises:

ValueError – If the name is reserved by another segment.

Return type:

None

append(other, filler=<function Segment.<lambda>>, joiner=<built-in function add>)[source]

Append another segment to this one.

The encompassed sequences will be merged together by joiner. If a sequence is missing in this segment or other, filler will create a sequence with filled values. The sequences will be deep-copied before merge.

>>> a = Segment(1, 3, "A", seqs={"A": "AAA"})
>>> b = Segment(1, 2, "B", seqs={"B": "BB"})
>>> c = a.append(b, filler=lambda x: '*' * x)
>>> c.id
'A|1-5'
>>> c['A']
'AAA**'
>>> c['B']
'***BB'

Note that the same can be achieved via | operator:

>>> a | b == a.append(b, filler=lambda x: '*' * x)
True

This will use "*" filler for str-type sequences and None for the rest and use the default joiner for joining them.

Note

Appending to an empty segment will return other. Appending an empty segment will return this segment.

Warning

Appending creates a new segment and removes associated parent and metadata

Parameters:
  • other (t.Self) – Another arbitrary segment.

  • filler (_Filler | abc.Mapping[str, _Filler]) – A callable accepting the positive integer and returning a filled in a sequence or a dict mapping sequence names to such callables.

  • joiner (_Joiner | abc.Mapping[str, _Joiner]) – A callable accepting two sequences and returning a merged sequence or a dict mapping sequence names to such callables.

Returns:

A new segment with the same name as this segment, extended by other.

Return type:

t.Self

bounded_by(other)[source]

Check whether this segment is bounded by other.

self:   +----+
other: +------+
=> True

:param other; Another segment.

Return type:

bool

bounds(other)[source]

Check if this segment bounds other.

self: +-------+
other:  +----+
=> True

:param other; Another segment.

Return type:

bool

id_strip_parents()[source]
Returns:

An identifier of this segment without parent information.

insert(other, i, **kwargs)[source]

Insert a segment into this one.

The function splits this segment into two parts at the provided index and insert other between them via append(). The latter handles common/unique sequences via filler and joiner arguments, which can be passed here as keyword arguments.

Note

Inserting an empty segment returns this instance. Inserting a segment at the end() appends other.

Warning

Inserting creates a new segment and removes associated parent and metadata

Parameters:
  • other (t.Self) – Another segment to insert.

  • i (int) – Index to insert at. The insertion will be performed after i.

  • kwargs – Passed to append().

Returns:

A new segment with inserted other.

Raises:

IndexError – If attempting to insert at an invalid index. Only indices start < i <= end are valid.

Return type:

t.Self

overlap(start, end)[source]

Create new segment from the current instance using overlapping boundaries.

Parameters:
  • start (int) – Starting coordinate.

  • end (int) – Ending coordinate.

Returns:

New overlapping segment with data and name

Return type:

t.Self

overlap_with(other, deep_copy=True, handle_mode='merge', sep='&')[source]

Overlap this segment with other over common indices.

self: +---------+
other:    +-------+
=>:       +-----+
Parameters:
  • other (Segment) – other Segment instance.

  • deep_copy (bool) – deepcopy seqs to avoid side effects.

  • handle_mode (str) –

    When the child overlapping segment is created, this parameter defines how name and meta are handled. The following values are possible:

    • ”merge”: merge meta and name from self and other

    • ”self”: the current instance provides both attributes

    • ”other”: other provides both attributes

  • sep (str) – If handle_mode == “merge”, the new name is created by joining names of self and other using this separator.

Returns:

New segment instance with inherited name and meta.

Return type:

t.Self

overlaps(other)[source]

Check whether a segment overlaps with the other segment. Use overlap_with() to produce an overlapping child Segment.

Parameters:

other (Segment) – other Segment instance.

Returns:

True if segments overlap and False otherwise.

Return type:

bool

remove_seq(name)[source]

Remove sequence from this segment.

Parameters:

name (str) – Sequence’s name. If doesn’t exist in this segment, nothing happens.

sub(start, end, **kwargs)[source]

Subset current segment using provided boundaries. Will create a new segment and call sub_by().

Parameters:
  • start (int) – new start.

  • end (int) – new end.

  • kwargs – passed to overlap_with()

Return type:

t.Self

sub_by(other, **kwargs)[source]

A specialized version of overlap_with() used in cases where other is assumed to be a part of the current segment (hence, a subsegment).

Parameters:
  • other (Segment) – Some other segment contained within the (start, end) boundaries.

  • kwargs – Passed to overlap_with().

Returns:

A new Segment object with boundaries of other. See overlap_with() on how to handle segments’ names and data.

Raises:

NoOverlap – If other’s boundaries lie outside the existing start, end.

Return type:

t.Self

children
property end: int
Returns:

A Segment’s end coordinate.

property id: str
Returns:

Unique segment’s identifier encapsulating name, boundaries and parents of a segment if it was spawned from another Segment instance. For example:

S|1-2<-(P|1-10)

would specify a segment S with boundaries [1, 2] descended from P.

property is_empty: bool
Returns:

True if the segment is empty. Emptiness is a special case, in which Segment has start == end == 0.

property is_singleton: bool
Returns:

True if the segment contains a single element. In this special case, start == end.

property item_type: _Item

A factory to make an Item namedtuple object encapsulating sequence names contained within this instance. The first field is reserved for “i” – an index. :return: Item namedtuple object.

meta: dict[str, t.Any]
property name: str
property parent: t.Self | None
property seq_names: list[str]
Returns:

A list of sequence names this segment entails.

property start: int
Returns:

A Segment’s start coordinate.

variables: Variables
lXtractor.core.segment.do_overlap(segments)[source]

Check if any pair of segments overlap.

Parameters:

segments (Iterable[Segment]) – an iterable with at least two segments.

Returns:

True if there are overlapping segments, False otherwise.

Return type:

bool

lXtractor.core.segment.map_segment_numbering(segments_from, segments_to)[source]

Create a continuous mapping between the numberings of two segment collections. They must contain the same number of equal length non-overlapping segments. Segments in the segments_from collection are considered to span a continuous sequence, possibly interrupted due to discontinuities in a sequence represented by segments_to’s segments. Hence, the segments in segments_from form continuous numbering over which numberings of segments_to segments are joined.

Parameters:
  • segments_from (Sequence[Segment]) – A sequence of segments to map from.

  • segments_to (Sequence[Segment]) – A sequence of segments to map to.

Returns:

An iterable over (key, value) pairs. Keys correspond to numberings of the segments_from, values – to numberings of segments_to.

Return type:

Iterator[tuple[int, int | None]]

lXtractor.core.segment.resolve_overlaps(segments, value_fn=<built-in function len>, max_it=None, verbose=False)[source]

Eliminate overlapping segments.

Convert segments into and undirected graph (see segments2graph()). Iterate over connected components. If a component has only a single node (no overlaps§), yield it. Otherwise, consider all possible non-overlapping subsets of nodes. Find a subset such that the sum of the value_fn over the segments is maximized and yield nodes from it.

Parameters:
  • segments (Iterable[Segment]) – A collection of possibly overlapping segments.

  • value_fn (Callable[[Segment], float]) – A function accepting the segment and returning its value.

  • max_it (int | None) – The maximum number of subsets to consider when resolving a group of overlapping segments.

  • verbose (bool) – Progress bar and general info.

Returns:

A collection of non-overlapping segments with maximum cumulative value. Note that the optimal solution is guaranteed iff the number of possible subsets for an overlapping group does not exceed max_it.

Return type:

Generator[Segment, None, None]

lXtractor.core.segment.segments2graph(segments)[source]

Convert segments to an undirected graph such that segments are nodes and edges are drawn between overlapping segments.

Parameters:

segments (Iterable[Segment]) – an iterable with segments objects.

Returns:

an undirected graph.

Return type:

Graph

lXtractor.core.structure module

Module defines basic interfaces to interact with macromolecular structures.

class lXtractor.core.structure.CarbohydrateStructure(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]

Bases: GenericStructure

A structure type where primary polymer is carbohydrate.

See also

GenericStructure for general-purpose documentation.

__init__(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]
Parameters:
  • array (AtomArray) – Atom array object.

  • name – ID of a structure in array.

  • ligands (bool | list[Ligand]) – A list of ligands or flag indicating to extract ligands during initialization.

class lXtractor.core.structure.GenericStructure(array, name, ligands=None, atom_marks=None, graph=None)[source]

Bases: object

A generic macromolecular structure with possibly many chains holding a single biotite.structure.AtomArray instance.

This object is a core data structure in lXtractor for structural data.

The object is considered immutable: atoms of a structure can’t change their location or properties, as well as other protected attributes.

While atoms are stored as biotite.structure.AtomArray, GenericStructure defines additional annotations for each atom and operations crucial for other objects such as lXtractor.core.chain.ChainStructure.

Upon initialization, atom array attains graph representation (graph()) using lXtractor.util.structure.to_graph() function. Using this representation, atom annotations are attained via :func``mark_atoms_g`. These annotations can be accessed via atom_marks(). For convenience, boolean masks are stored and can be applied to the array() as follows:

# Assume ``s`` is a :class:`GenericStructure` object.
s[s.mask.`mask_name`]

To view available mask names, see Masks.

One of the most crucial annotations is the so-called “primary_polymer”. These atoms serve as a frame of reference for all other atoms in a structure. The rest of the atoms are categorized as either ligand or solvent. Sometimes the annotation process fails to identify certain atoms. In such cases, a warning is logged. To view uncategorized atoms, one can use the following mask:

s[s.mask.unk]

Note

Using __getitem__(item) like in s[s.mask.unk will return an atom array. Use subset() to obtain a new generic structure or initialize a new ``GenericStructure(s[s.mask.unk] instance; it will be equivalent.

Methods __repr__ and __str__ output a string in the format: {_name}:{polymer_chain_ids};{ligand_chain_ids}|{altloc_ids} where *ids are “,”-separated.

__init__(array, name, ligands=None, atom_marks=None, graph=None)[source]
Parameters:
  • array (AtomArray) – Atom array object.

  • name (str) – ID of a structure in array.

  • ligands (Sequence[Ligand] | None) – A list of ligands or flag indicating to extract ligands during initialization.

extract_positions(pos, chain_ids=None, **kwargs)[source]

Extract specific positions from this structure.

Parameters:
  • pos (abc.Sequence[int]) – A sequence of positions (res_id) to extract.

  • chain_ids (abc.Sequence[str] | str | None) – Optionally, a single chain ID or a sequence of such.

  • kwargs – Passed to subset().

Returns:

A new instance with extracted residues.

Return type:

t.Self

extract_segment(start, end, chain_id, **kwargs)[source]

Create a sub-structure encompassing some continuous segment bounded by existing position boundaries.

Parameters:
  • start (int) – Residue number to start from (inclusive).

  • end (int) – Residue number to stop at (inclusive).

  • chain_id (str) – Chain to extract a segment from.

  • kwargs – Passed to subset().

Returns:

A new Generic structure with residues in [start, end].

Return type:

t.Self

get_sequence()[source]
Returns:

A generator over tuples, where each residue is described by: (1) one-letter code, (2) three-letter code, (3) residue number.

Return type:

Generator[tuple[str, str, int]]

classmethod make_empty(structure_id='XXXX')[source]
Parameters:

structure_id (str) – (Optional) ID of the created array.

Returns:

An instance with empty array().

Return type:

t.Self

classmethod read(inp, path2id=<function GenericStructure.<lambda>>, structure_id='XXXX', altloc=False, **kwargs)[source]

Parse the atom array from the provided input and wrap it into the GenericStructure object.

Note

If inp is not a Path, kwargs must contain the correct fmt (e.g., fmt=cif).

Parameters:
  • inp (IOBase | Path | str | bytes) – Path to a structure in supported format.

  • path2id (abc.Callable[[Path], str]) – A callable obtaining a PDB ID from the file path. By default, it’s a Path.stem.

  • structure_id (str) – A structure unique identifier (e.g., PDB ID). If not provided and the input is Path, will use path2id to infer the ID. Otherwise, will use a constant placeholder.

  • altloc (bool | str) – Parse alternative locations and populate array.altloc_id attribute.

  • kwargs – Passed to load_structure.

Returns:

Parsed structure.

Return type:

t.Self

rm_solvent(copy=False)[source]
Parameters:

copy (bool) – Copy the resulting substructure.

Returns:

A substructure with solvent molecules removed.

split_altloc(**kwargs)[source]

Split into substructures based on altloc IDs. Atoms missing altloc annotations are distributed into every substructure. Thus, even if a structure contains a single atom having altlocs (say, A and B), this method will produce two substructed identical except for this atom.

Note

If array() does not specify any altloc ID, the method yields self.

Parameters:

kwargs – Passed to subset().

Returns:

An iterator over objects of the same type initialized by atoms having altloc annotations.

Return type:

abc.Iterator[t.Self]

split_chains(polymer=False, **kwargs)[source]

Split into separate chains. Splitting is done using biotite.structure.get_chain_starts().

Note

Preserved ligands may have a different chain_id.

Note

If there is a single chain, this method will return self.

Parameters:
  • polymer (bool) – Use only primary polymer chains for splitting.

  • kwargs – Passed to subset().

Returns:

An iterable over chains found in array.

Return type:

abc.Iterator[t.Self]

subset(mask, ligands=True, reinit_ligands=False, copy=False)[source]

Create a sub-structure potentially preserving connected ligands().

Warning

If DefaultConfig["structure"]["primary_pol_type"] is set to auto, and mask points to a polymer that is shorter than some existing ligand polymer, this ligand polymer will become a primary polymer in the substructure.

Parameters:
  • mask (np.ndarray) – Boolean mask, True for atoms in array(), used to create a sub-structure.

  • ligands (bool) – Keep ligands that are connected to atoms specified by mask.

  • reinit_ligands (bool) – Reinitialize ligands upon creating a sub-structure, rather than filtering existing ligands connected to atoms specified by mask. Takes precedence over the ligands option. This option is used in split_altloc().

  • copy (bool) – Copy the atom array resulting from subsetting the original one.

Returns:

A new instance with atoms defined by mask and connected ligands.

Return type:

t.Self

superpose(other, res_id_self=None, res_id_other=None, atom_names_self=None, atom_names_other=None, mask_self=None, mask_other=None)[source]

Superpose other structure to this one. Arguments to this function all serve a single purpose: to correctly subset both structures so the resulting selections have the same number of atoms.

The subsetting achieved either by specifying residue numbers and atom names or by supplying a binary mask of the same length as the number of atoms in the structure.

Parameters:
  • other (GenericStructure | AtomArray) – Other GenericStructure or atom array.

  • res_id_self (Iterable[int] | None) – Residue numbers to select in this structure.

  • res_id_other (Iterable[int] | None) – Residue numbers to select in other structure.

  • atom_names_self (Iterable[Sequence[str]] | Sequence[str] | None) – Atom names to select in this structure given either per-residue or as a single sequence broadcasted to selected residues.

  • atom_names_other (Iterable[Sequence[str]] | Sequence[str] | None) – Same as self.

  • mask_self (ndarray | None) – Binary mask to select atoms. Takes precedence over other selection arguments.

  • mask_other (ndarray | None) – Same as self.

Returns:

A tuple of (1) an other structure superposed onto this one, (2) an RMSD of the superposition, and (3) a transformation that had been used with biotite.structure.superimpose_apply().

Return type:

tuple[GenericStructure, float, tuple[ndarray, ndarray, ndarray]]

write(path, atom_marks=True, graph=True)[source]

Save this structure to a file. The format is automatically determined from the given path.

Additional files are saved using the same filename alongside the structure file. The filename will resolve to “structure” in all the following cases and result in “structure.npy” and “structure.json” files saved to the same dir:

path="/path/to/structure.pdb"
path="/path/to/structure.mmtf.gz"
path="/path/to/structure.with.many.dots.pdb.gz"
Parameters:
  • path (PathLike | str) – A path or a path-like object compatible with open(). Must not point to an existing directory. Must provide the structure format as an extension.

  • atom_marks (bool) – Save an array of atom marks in the npy format.

  • graph (bool) – Save molecular connectivity graph in the json format.

Returns:

Path to the saved structure if writing was successful.

Return type:

Path

property altloc_ids: list[str]
Returns:

A sorted list of altloc IDs. If none found, will output [""].

property array: AtomArray
Returns:

Atom array object.

property atom_marks: ndarray[Any, dtype[int64]]
Returns:

An array of lXtractor.core.config.AtomMark marks, categorizing each atom in this structure.

property chain_ids: list[str]
Returns:

A list of chain IDs this structure encompasses.

property chain_ids_ligand: list[str]
Returns:

A set of ligand chain IDs.

property chain_ids_polymer: list[str]
Returns:

A list of polymer chain IDs.

property graph: PyGraph
Returns:

A structure’s graph representation.

property id: str
Returns:

An identifier of this structure. It’s composed once upon initialization and has the following format: {_name}:{polymer_chain_ids};{ligand_chain_ids}|{altloc_ids}. It should uniquely identify a structure, i.e., one should expect two structures with the same ID to be identical.

property is_empty: bool
Returns:

True if the array() is empty.

property is_empty_polymer: bool

Check if there are any polymer atoms.

Returns:

True if there are >=1 polymer atoms and False otherwise.

property is_singleton: bool
Returns:

True if the structure contains a single residue.

property ligands: tuple[Ligand, ...]
Returns:

A list of ligands.

property mask: Masks
property name: str
Returns:

A name of the structure.

class lXtractor.core.structure.Masks(primary_polymer: 'npt.NDArray[np.bool_]', primary_polymer_ptm: 'npt.NDArray[np.bool_]', primary_polymer_modified: 'npt.NDArray[np.bool_]', solvent: 'npt.NDArray[np.bool_]', ligand: 'npt.NDArray[np.bool_]', ligand_covalent: 'npt.NDArray[np.bool_]', ligand_poly: 'npt.NDArray[np.bool_]', ligand_nonpoly: 'npt.NDArray[np.bool_]', ligand_pep: 'npt.NDArray[np.bool_]', ligand_nuc: 'npt.NDArray[np.bool_]', ligand_carb: 'npt.NDArray[np.bool_]', unk: 'npt.NDArray[np.bool_]')[source]

Bases: object

__init__(primary_polymer, primary_polymer_ptm, primary_polymer_modified, solvent, ligand, ligand_covalent, ligand_poly, ligand_nonpoly, ligand_pep, ligand_nuc, ligand_carb, unk)
ligand: ndarray[Any, dtype[bool_]]
ligand_carb: ndarray[Any, dtype[bool_]]
ligand_covalent: ndarray[Any, dtype[bool_]]
ligand_nonpoly: ndarray[Any, dtype[bool_]]
ligand_nuc: ndarray[Any, dtype[bool_]]
ligand_pep: ndarray[Any, dtype[bool_]]
ligand_poly: ndarray[Any, dtype[bool_]]
primary_polymer: ndarray[Any, dtype[bool_]]
primary_polymer_modified: ndarray[Any, dtype[bool_]]
primary_polymer_ptm: ndarray[Any, dtype[bool_]]
solvent: ndarray[Any, dtype[bool_]]
unk: ndarray[Any, dtype[bool_]]
class lXtractor.core.structure.NucleotideStructure(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]

Bases: GenericStructure

A structure type where primary polymer is nucleotide.

See also

GenericStructure for general-purpose documentation.

__init__(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]
Parameters:
  • array (AtomArray) – Atom array object.

  • name – ID of a structure in array.

  • ligands (bool | list[Ligand]) – A list of ligands or flag indicating to extract ligands during initialization.

class lXtractor.core.structure.ProteinStructure(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]

Bases: GenericStructure

A structure type where primary polymer is peptide.

See also

GenericStructure for general-purpose documentation.

__init__(array, structure_id, ligands=True, atom_marks=None, graph=None)[source]
Parameters:
  • array (AtomArray) – Atom array object.

  • name – ID of a structure in array.

  • ligands (bool | list[Ligand]) – A list of ligands or flag indicating to extract ligands during initialization.

lXtractor.core.structure.mark_atoms(structure)[source]

Mark each atom in structure according to lXtractor.core.config.AtomMark.

This function is used upon initializing GenericStructure and its subclasses, storing the output under GenericStructure.atom_marks.

Parameters:

structure (GenericStructure) – An arbitrary structure.

Returns:

An array of atom marks (equivalently, classes or types).

Return type:

tuple[ndarray[Any, dtype[int64]], list[Ligand]]

lXtractor.core.structure.mark_atoms_g(s, single_poly_chain=False)[source]

Mark structure atoms based on a molecular graph’s representation by of the lXtractor.core.config.AtomMark categories.

Atoms are classified into five categories:

#. primary polymer: corresponds to ``PEP``, ``NUC`` or ``CARB``
categories.
#. solvent: ``SOLVENT``.
#. non polymer ligand: ``LIGAND``.
#. polymer ligand: A combination of ``LIGAND`` with one of the primary
polymer types, eg. ``AtomMark.LIGAND | AtomMark.NUC``.
#. unknown: ``UNK`` for atoms that couldn't be categorized.

The classification process depends on groups of atoms forming covalent bonds with each other, or connected components in the molecular graph representation. Each such component is assessed separately and its atoms are classified as polymer, ligand, or solvent. If the primary polymer is set to “auto” in config (DefaultConfig["structure"]["primary_pol_type"]), the polymer with the largest number of monomers will be selected. The rest of the polymers will become polymer ligands: special kind of ligand that can have multiple residues. See lXtractore.core.ligand.Ligand for details.

Parameters:
Returns:

Return type:

(npt.NDArray[np.int_], str, list[Ligand])