lXtractor.util package

lXtractor.util.io module

Various utilities for IO.

lXtractor.util.io.fetch_chunks(it, fetcher, chunk_size=100, **kwargs)[source]

A wrapper for fetching multiple links with ThreadPoolExecutor.

Parameters:

it (Iterable[V]) – Iterable over some objects accepted by the fetcher, e.g., links.
fetcher (Callable[[list[V]], T]) – A callable accepting a chunk of objects from it, fetching and returning the result.
chunk_size (int) – Split iterable into this many chunks for the executor.
kwargs – Passed to fetch_iterable().

Returns:

A list of results

Return type:

Generator[tuple[list[V], T | Future], None, None]

lXtractor.util.io.fetch_iterable(it, fetcher, num_threads=None, verbose=False, blocking=True, allow_failure=True)[source]

Parameters:

it (Iterable[V]) – Iterable over some objects accepted by the fetcher, e.g., links.
fetcher (Callable[[V], T]) – A callable accepting a chunk of objects from it, fetching and returning the result.
num_threads (int | None) – The number of threads for ThreadPoolExecutor.
verbose (bool) – Enable progress bar and warnings/exceptions on fetching failures.
blocking (bool) – If True, will wait for each result. Otherwise, will return Future objects instead of fetched data.
allow_failure (bool) – If True, failure to fetch will raise a warning isntead of an exception. Otherwise, the warning is logged, and the results won’t contain inputs that failed to fetch.

Returns:

A list of tuples where the first object is the input and the second object is the fetched data.

Return type:

Generator[tuple[V, T], None, None] | Generator[tuple[V, Future[T]], None, None]

lXtractor.util.io.fetch_text(url, decode=False, chunk_size=8192, **kwargs)[source]

Fetch the content as a single string. This will use the requests.get with stream=True by default to split the download into chunks and thus avoid taking too much memory at once.

Parameters:

url (str) – Link to fetch from.
decode (bool) – Decode the received bytes to utf-8.
chunk_size (int) – The number of bytes to use when splitting the fetched result into chunks.
kwargs – Passed to requests.get().

Returns:

Fetched text as a single string.

Return type:

str | bytes

lXtractor.util.io.fetch_to_file(url, fpath=None, fname=None, root_dir=None, decode=False)[source]

Parameters:

url (str) – Link to a file.
fpath (Path | None) – Path to a file for saving. If provided, fname and root_dir are ignored. Otherwise, will use .../{this} from the link for the file name and save into the current dir.
fname (str | None) – Name of the file to save.
root_dir (Path | None) – Dir where to save the file.
decode (bool) – If True, try decoding the raw request’s content.

Returns:

Local path to the file.

Return type:

Path

lXtractor.util.io.fetch_urls(url_getter, url_getter_args, fmt, dir_, *, fname_idx=0, args_applier=None, callback=None, overwrite=False, decode=False, max_trials=1, num_threads=None, verbose=False)[source]

A general-purpose function for fetching URLs. Each URL is dynamically produced via URL getters supplied with positional arguments.

lXtractor.util.misc module

Miscellaneous utilities that couldn’t be properly categorized.

lXtractor.util.misc.all_logging_disabled(highest_level=50)[source]

A context manager that will prevent any logging messages triggered during the body from being processed.

The function was borrowed from this gist

Parameters:: highest_level – the maximum logging level in use. This would only need to be changed if a custom level greater than CRITICAL is defined.

lXtractor.util.misc.apply(fn, it, verbose, desc, num_proc, total=None, use_joblib=False, **kwargs)[source]

Parameters:

fn (Callable[[T], R]) – A one-argument function.
it (Iterable[T]) – An iterable over some objects.
verbose (bool) – Display progress bar.
desc (str) – Progress bar description.
num_proc (int) – The number of processes to use. Anything below 1 indicates sequential processing. Otherwise, will apply fn in parallel using ProcessPoolExecutor.
total (int | None) – The total number of elements. Used for the progress bar.
use_joblib (bool) – Use joblib.Parallel for parallel application.

Returns:

Passed to ProcessPoolExecutor.map() or joblib.Parallel.

Return type:

Iterator[R]

lXtractor.util.misc.col2col(df, col_fr, col_to)[source]

Parameters:

df (DataFrame) – Some DataFrame.
col_fr (str) – A column name to map from.
col_to (str) – A column name to map to.

Returns:

Mapping between values of a pair of columns.

lXtractor.util.misc.get_cpu_count(c)[source]

lXtractor.util.misc.graph_reindex_nodes(g)[source]

Reindex the graph nodes so that node data equals to node indices.

Parameters:: g (PyGraph) – An arbitrary PyGraph.
Returns:: A PyGraph of the same size and having the same edges but with reindexed nodes.
Return type:: PyGraph

lXtractor.util.misc.is_empty(x)[source]

Return type:: bool

lXtractor.util.misc.is_valid_field_name(s)[source]

Parameters:: s (str) – Some string.
Returns:: True if s` is a valid field name for ``__getattr__ `` operations else ``False.
Return type:: bool

lXtractor.util.misc.json_to_molgraph(inp)[source]

Converts a JSON-formatted molecular graph into a PyGraph object. This graph is a dictionary with two keys: “num_nodes” and “edges”. The former indicates the number of atoms in a structure, whereas the latter is a list of edge tuples.

Parameters:: inp (dict | PathLike) – A dictionary or a path to a JSON file produced using rustworkx.node_link_json.
Returns:: A graph with nodes and edges initialized in order given in inp. Any associated data will be omitted.
Return type:: PyGraph

lXtractor.util.misc.valgroup(m, sep=':')[source]

Reformat a mapping from the format:

X => [Y{sep}Z, ...]

To a format:

X => [(Y, [Z, ...]), ...]

>>> mapping = {'X': ['C:A', 'C:B', 'Y:Z']}
>>> valgroup(mapping)
{'X': [('X', ['A', 'B']), ('Y', ['Z'])]}

Hint

This method is useful for converting the sequence-to-structure mapping outputted by lXtractor.ext.sifts.SIFTS to a format accepted by the :method:`lXtractor.core.chain.initializer.ChainInitializer.from_mapping` to initialize lXtractor.core.chain.Chain objects

Parameters:

m (Mapping[str, list[str]]) – A mapping from strings to a list of strings.
sep (str) – A separator of each mapped string in the list.

Returns:

A reformatted mapping.

lXtractor.util.seq module

Low-level utilities to work with sequences (as strings) or sequence files.

lXtractor.util.seq.biotite_align(seqs, **kwargs)[source]

Align two sequences using biotite align_optimal function.

Parameters:

seqs (Iterable[tuple[str, str]]) – An iterable with exactly two sequences.
kwargs – Additional arguments to align_optimal.

Returns:

A pair of aligned sequences.

Return type:

tuple[tuple[str, str], tuple[str, str]]

lXtractor.util.seq.mafft_add(msa, seqs, *, mafft='mafft', thread=1, keeplength=True)[source]

Add sequences to existing MSA using mafft.

This is a curried function: incomplete argument set yield partially evaluated function (e.g., mafft_add(thread=10)).

Parameters:

msa (Iterable[tuple[str, str]] | Path) – an iterable over sequences with the same length.
seqs (Iterable[tuple[str, str]]) – an iterable over sequences comprising the addition.
thread (int) – how many threads to dedicate for mafft.
keeplength (bool) – force to preserve the MSA’s length.
mafft (str) – mafft executable.

Returns:

A tuple of two lists of SeqRecord objects: with (1) alignment sequences with addition, and (2) aligned addition, separately.

Return type:

Iterator[tuple[str, str]]

lXtractor.util.seq.mafft_align(seqs, *, mafft='mafft-linsi', thread=1)[source]

Align an arbitrary number of sequences using mafft.

Parameters:

seqs (Iterable[tuple[str, str]] | Path) – An iterable over (header, _seq) pairs or path to file with sequences to align.
thread (int) – How many threads to dedicate for mafft.
mafft (str) – mafft executable (path or env variable).

Returns:

An Iterator over aligned (header, _seq) pairs.

Return type:

Iterator[tuple[str, str]]

lXtractor.util.seq.map_pairs_numbering(s1, s1_numbering, s2, s2_numbering, align=True, align_method=<function mafft_align>, empty=None, **kwargs)[source]

Map numbering between a pair of sequences.

Parameters:

s1 (str) – The first sequence.
s1_numbering (Iterable[int]) – The first sequence’s numbering.
s2 (str) – The second sequence.
s2_numbering (Iterable[int]) – The second sequence’s numbering.
align (bool) – Align before calculating. If False, sequences are assumed to be aligned.
align_method (AlignMethod) – Align method to use. Must be a callable accepting and returning a list of sequences.
empty (Any | None) – Empty numeration element in place of a gap.
kwargs – Passed to align_method.

Returns:

Iterator over character pairs (a, b), where a and b are the original sequences’ numberings. One of a or b in a pair can be empty to represent a gap.

Return type:

Generator[tuple[int | None, int | None], None, None]

lXtractor.util.seq.partition_gap_sequences(seqs, max_fraction_of_gaps=1.0)[source]

Removes sequences having fraction of gaps above the given threshold.

Parameters:

seqs (Iterable[tuple[str, str]]) – a collection of arbitrary sequences.
max_fraction_of_gaps (float) – a threshold specifying an upper bound on allowed fraction of gap characters within a sequence.

Returns:

a filtered list of sequences.

Return type:

tuple[Iterator[str], Iterator[str]]

lXtractor.util.seq.read_fasta(inp, strip_id=True)[source]

Simple lazy fasta reader.

Parameters:

inp (str | PathLike | TextIOBase | Iterable[str]) – Pathlike object compatible with open or opened file or an iterable over lines or raw text as str.
strip_id (bool) – Strip ID to the first consecutive (spaceless) string.

Returns:

An iterator of (header, seq) pairs.

Return type:

Iterator[tuple[str, str]]

lXtractor.util.seq.remove_gap_columns(seqs, max_gaps=1.0)[source]

Remove gap columns from a collection of sequences.

Parameters:

seqs (Iterable[str]) – A collection of equal length sequences.
max_gaps (float) – Max fraction of gaps allowed per column.

Returns:

Initial seqs with gap columns removed and removed columns’ indices.

Return type:

tuple[Iterator[str], ndarray]

lXtractor.util.seq.write_fasta(inp, out)[source]

Simple fasta writer.

Parameters:

inp (Iterable[tuple[str, str]]) – Iterable over (header, _seq) pairs.
out (Path | SupportsWrite) – Something that supports .write method.

Returns:

Nothing.

Return type:

None

lXtractor.util.structure module

Low-level utilities to work with structures.

lXtractor.util.structure.calculate_dihedral(atom1, atom2, atom3, atom4)[source]

Calculate angle between planes formed by [a1, a2, atom3] and [a2, atom3, atom4].

Each atom is an array of shape (3, ) with XYZ coordinates.

Calculation method inspired by https://math.stackexchange.com/questions/47059/how-do-i-calculate-a- dihedral-angle-given-cartesian-coordinates

Return type:: float

lXtractor.util.structure.compare_arrays(a, b, eps=0.001)[source]

Compare two numerical arrays.

Parameters:

a (ndarray[Any, dtype[float | int]]) – The first array.
b (ndarray[Any, dtype[float | int]]) – The second array.
eps (float) – Comparison tolerance.

Returns:

True if the absolute difference between the two arrays is within eps.

Raises:

LengthMismatch – If the two arrays are not of the same shape.

lXtractor.util.structure.compare_coord(a, b, eps=0.001)[source]

Compare coordinates between atoms of two atom arrays.

Parameters:

a (AtomArray) – The first atom array.
b (AtomArray) – The second atom array.
eps (float) – Comparison tolerance.

Returns:

True if the two arrays are of the same length and the absolute difference between coordinates of the corresponding atom pairs is within eps.

lXtractor.util.structure.extend_residue_mask(a, idx)[source]

Extend a residue mask for given atoms.

Parameters:

a (AtomArray) – An arbitrary atom array.
idx (list[int]) – Indices pointing to atoms at which to extend the mask.

Returns:

The extended mask, where True indicates that the atom belongs to the same residue as indicated by idx.

Return type:

ndarray[Any, dtype[bool_]]

lXtractor.util.structure.filter_any_polymer(a, min_size=2)[source]

Get a mask indicating atoms being a part of a macromolecular polymer: peptide, nucleotide, or carbohydrate.

Parameters:

a (AtomArray) – Array of atoms.
min_size (int) – Min number of polymer monomers.

Returns:

A boolean mask True for polymers’ atoms.

Return type:

ndarray

lXtractor.util.structure.filter_ligand(a)[source]

Filter for ligand atoms – non-polymer and non-solvent hetero atoms.

..note ::: No contact-based verification is performed here.

Parameters:: a (AtomArray) – Atom array.
Returns:: A boolean mask True for ligand atoms.
Return type:: ndarray

lXtractor.util.structure.filter_polymer(a, min_size=2, pol_type='peptide')[source]

Filter for atoms that are a part of a consecutive standard macromolecular polymer entity.

Parameters:

a (AtomArray) – The array to filter.
min_size – The minimum number of monomers.
pol_type – The polymer type, either "peptide", "nucleotide", or "carbohydrate". Abbreviations are supported: "p", "pep", "n", etc.

Returns:

This array is True for all indices in array, where atoms belong to consecutive polymer entity having at least min_size monomers.

Return type:

ndarray[Any, dtype[bool_]]

lXtractor.util.structure.filter_selection(array, res_id, atom_names=None)[source]

Filter AtomArray by residue numbers and atom names.

Parameters:

array (AtomArray) – Arbitrary structure.
res_id (Sequence[int] | None) – A sequence of residue numbers.
atom_names (Sequence[Sequence[str]] | Sequence[str] | None) – A sequence of atom names (broadcasted to each position in res_id) or an iterable over such sequences for each position in res_id.

Returns:

A binary mask that is True for filtered atoms.

Return type:

ndarray

lXtractor.util.structure.filter_solvent_extended(a)[source]

Filter for solvent atoms using a curated solvent list including non-water molecules typically being a part of a crystallization solution.

Parameters:: a (AtomArray) – Atom array.
Returns:: A boolean mask True for solvent atoms.
Return type:: ndarray

lXtractor.util.structure.filter_to_common_atoms(a1, a2, allow_residue_mismatch=False)[source]

Filter to atoms common between residues of atom arrays a1 and a2.

Parameters:

a1 (AtomArray) – Arbitrary atom array.
a2 (AtomArray) – Arbitrary atom array.
allow_residue_mismatch (bool) – If True, when residue names mismatch, the common atoms are derived from the intersection a1.atoms & a2.atoms & {"C", "N", "CA", "CB"}.

Returns:

A pair of masks for a1 and a2, True for matching atoms.

Raises:

ValueError –

If a1 and a2 have different number of residues.
If the selection for some residue produces different number
of atoms.

Return type:

tuple[ndarray, ndarray]

lXtractor.util.structure.find_contacts(a, mask)[source]

Find contacts between a subset of atoms within the structure and the rest of the structure. An atom is considered to be in contact with another atom if the distance between them is below the threshold for the non-covalent bond specified in config (DefaultConfig["bonds"]["NC-NC"][1]).

Parameters:

a (AtomArray) – Atom array.
mask (ndarray) – A boolean mask True for atoms for which to find contacts.

Returns:

A tuple with three arrays of size equal to the a’s number of atoms:

Contact mask: True for a[~mask] atoms in contact with
a[mask].
Distances: for a[mask] atoms to the closest a[~mask] atom.
Indices: of these closest a[~mask] atoms within the mask.

Suppose that mask specifies a ligand. Then, for i-th atom in a, contacts[i], distances[i], indices[i] indicate whether a[i] has a contact, the precise distance from a[i] atom to the closest ligand atom, and an index of this ligand atom, respectively.

Return type:

tuple[ndarray, ndarray, ndarray]

lXtractor.util.structure.find_first_polymer_type(a, min_size=2, order=('p', 'n', 'c'))[source]

Determines polymer type of the supplied atom array or an array of atom marks.

Probe polymer types in a sequence in a given order. If a polymer with at least min_size atoms of the probed type is found, it will be returned.

Hint

The function serves as a good quick-check when a single polymer type is expected, which should always be true when a is an array of atom marks.

Parameters:

a (AtomArray | ndarray[Any, dtype[int]]) – An arbitrary array of atoms.
min_size (int) – A minimum number of monomers in a polymer.
order (tuple[str, str, str]) – An order of the polymers to probe.

Returns:

The first polymer type to accommodate min_size requirement.

Return type:

str

lXtractor.util.structure.find_primary_polymer_type(a, min_size=2, residues=False)[source]

Find the major polymer type, i.e., the one with the largest number of atoms or monomers.

Parameters:

a (AtomArray) – An arbitrary atom array.
min_size (int) – Minimum number of monomers for a polymer.
residues (bool) – True if the dominant polymer should be picked according to the number of residues. Otherwise, the number of atoms will be used.

Returns:

A binary mask pointing at the polymer atoms in a and the polymer type – “c” (carbohydrate), “n” (nucleotide), or “p” (peptide). If no polymer atoms were found, polymer type will be designated as “x”.

Return type:

tuple[ndarray, str]

lXtractor.util.structure.get_missing_atoms(a, excluding_names=('OXT',), excluding_elements=('H',))[source]

For each residue, compare with the one stored in CCD, and find missing atoms.

Parameters:

a (AtomArray) – Non-empty atom array.
excluding_names (Sequence[str] | None) – A sequence of atom names to exclude for calculation.
excluding_elements (Sequence[str] | None) – A sequence of element names to exclude for calculation.

Returns:

A generator of lists of missing atoms (excluding hydrogens) per residue in a or None if not such residue was found in CCD.

Return type:

Generator[list[str | None] | None, None, None]

lXtractor.util.structure.get_observed_atoms_frac(a, excluding_names=('OXT',), excluding_elements=('H',))[source]

Find fractions of observed atoms compared to canonical residue versions stored in CCD.

Parameters:

a (AtomArray) – Non-empty atom array.
excluding_names (Sequence[str] | None) – A sequence of atom names to exclude for calculation.
excluding_elements (Sequence[str] | None) – A sequence of element names to exclude for calculation.

Returns:

A generator of observed atom fractions per residue in a or None if a residue was not found in CCD.

Return type:

Generator[list[str | None] | None, None, None]

lXtractor.util.structure.iter_canonical(a)[source]

Parameters:: a (AtomArray) – Arbitrary atom array.
Returns:: Generator of canonical versions of residues in a or None if no such residue found in CCD.
Return type:: Generator[AtomArray | None, None, None]

lXtractor.util.structure.iter_residue_masks(a)[source]

Iterate over residue masks.

Parameters:: a (AtomArray) – Atom array.
Returns:: A generator over boolean masks for each residue in a.
Return type:: Generator[ndarray[Any, dtype[bool_]], None, None]

lXtractor.util.structure.load_structure(inp, fmt='', *, gz=False, **kwargs)[source]

This is a simplified version of a biotite.io.general.load_structure extending the supported input types. Namely, it allows using paths, strings, bytes or gzipped files. On the other hand, there are less supported formats: pdb, cif, and mmtf.

Parameters:

inp (IOBase | Path | str | bytes) – Input to load from. It can be a path to a file, an opened file handle, a string or bytes of file contents. Gzipped bytes and files are supported.
fmt (str) – If inp is a Path-like object, it must be of the form “name.fmt” or “name.fmt.gz”. In this case, fmt is ignored. Otherwise, it is used to determine the parser type and must be provided.
gz (bool) – If inp is gzipped bytes, this flag must be True.
kwargs – Passed to get_structure: either a method or a separate function used by biotite to convert the input into an AtomArray.

Returns:

Return type:

AtomArray

lXtractor.util.structure.mark_polymer_type(a, min_size=2)[source]

Denote polymer type in an atom array.

It will find the breakpoints in a and split it into segments. Each segment will be checked separately to determine its polymer type. The results are then concatenated into a single array and returned.

Parameters:

a (AtomArray) – Any atom array.
min_size (int) – Minimum number of consecutive monomers in a polymer.

Returns:

An array where each atom of a is marked by a character: "n", "p", or "c" for nucleotide, peptide, and carbohydrate. Non-polymer atoms are marked by “x”.

Return type:

ndarray[Any, dtype[str_]]

lXtractor.util.structure.save_structure(array, path, **kwargs)[source]

This is a simplified version of a biotite.io.general.save_structure. On the one hand, it can conveniently compress the data using gzip. On the other hand, the number of supported formats is fewer: pdb, cif, and mmtf.

Parameters:

array (AtomArray) – An AtomArray to write.
path (Path) – A path with correct extension, e.g., Path("data/structure.pdb"), or Path("data/structure.pdb.gz").
kwargs – If compressing is not required, the original save_structure from biotite is used with these kwargs. Otherwise, kwargs are ignored.

Returns:

If the file was successfully written, returns the original path.

lXtractor.util.structure.to_graph(a, split_chains=False)[source]

Create a molecular connectivity graph from an atom array.

Molecular graph is a undirected graph without multiedges, where nodes are indices to atoms. Thus, node indices point directly to atoms in the provided atom array, and the number of nodes equals the number of atoms. A pair of nodes has an edge between them, if they form a covalent bond. The edges are constructed according to atom-depended bond thresholds defined by the global config. These distances are stored as edge values. See the docs of rustworkx on how to manipulate the resulting graph object.

Parameters:

a (AtomArray) – Atom array to guild a graph from.
split_chains (bool) – Edges between atoms from different chains are forbidden.

Returns:

A graph object where nodes are atom indices and edges represent covalent bonds.

Return type:

PyGraph