lXtractor.chain package

lXtractor.chain.base module

lXtractor.chain.base.is_chain_type(s)[source]
Return type:

t.TypeGuard[CTU]

lXtractor.chain.base.is_chain_type_iterable(s)[source]
Return type:

t.TypeGuard[abc.Iterable[Chain] | abc.Iterable[ChainSequence] | abc.Iterable[ChainStructure]]

lXtractor.chain.base.topo_iter(start_obj, iterator)[source]

Iterate over sequences in topological order.

>>> n = 1
>>> it = topo_iter(n, lambda x: (x + 1 for n in range(x)))
>>> next(it)
[2]
>>> next(it)
[3, 3]
Parameters:
  • start_obj (T) – Starting object.

  • iterator (Callable[[T], Iterable[T]]) – A callable accepting a single argument of the same type as the start_obj and returning an iterator over objects with the same type, representing the next level.

Returns:

A generator yielding lists of objects obtained using iterator and representing topological levels with the root in start_obj.

Return type:

Generator[list[T], None, None]

lXtractor.chain.sequence module

class lXtractor.chain.sequence.ChainSequence(start, end, name='S', seqs=None, parent=None, children=None, meta=None, variables=None)[source]

Bases: Segment

A class representing polymeric sequence of a single entity (chain).

The sequences are stored internally as a dictionary {seq_name => _seq} and must all have the same length. Additionally, seq_name must be a valid field name: something one could use in namedtuples. If unsure, please use lXtractor.util.misc.is_valid_field_name() for testing.

A single gap-less primary sequence (seq1()) is mandatory during the initialization. We refer to the sequences other than seq1() as “maps.” To view the standard sequence names supported by ChainSequence, use the flied_names() property.

The sequence can be a part of a larger one. The child-parent relationships are indicated via parent and attr:children, where the latter entails any sub-sequence. A preferable way to create subsequences is the spawn_child() method.

>>> seqs = {
...     'seq1': 'A' * 10,
...     'A': ['A', 'N', 'Y', 'T', 'H', 'I', 'N', 'G', '!', '?']
... }
>>> cs = ChainSequence(1, 10, 'CS', seqs=seqs)
>>> cs
CS|1-10
>>> assert len(cs) == 10
>>> assert 'A' in cs and 'seq1' in cs
>>> assert cs.seq1 == 'A' * 10
apply_children(fn, inplace=False)[source]

Apply some function to children.

Parameters:
  • fn (ApplyT[ChainSequence]) – A callable accepting and returning the chain sequence type instance.

  • inplace (bool) – Apply to children in place. Otherwise, return a copy with only children transformed.

Returns:

A chain sequence with transformed children.

Return type:

t.Self

apply_to_map(map_name, fn, inplace=False, preserve_children=False, apply_to_children=False)[source]

Apply some function to map/sequence in this chain sequence.

Parameters:
  • map_name (str) – Name of the internal sequence/map.

  • fn (ApplyT[abc.Sequence]) – A function accepting and returning a sequence of the same length.

  • inplace (bool) – Apply the operation to this object. Otherwise, create a copy with the transformed sequence.

  • preserve_children (bool) – Preserve children of this instance in the transformed object. Passing True makes sense if the target sequence is mutable: the children’s will be transformed naturally. In the target sequence is immutable, consider passing True with apply_to_children=True.

  • apply_to_children (bool) – Recursively apply the same fn to a child tree starting from this instance. If passed, sets preserve_children=True: otherwise, one is at risk of removing all children in the child tree of the returned instance.

Returns:

Return type:

t.Self

as_chain(transfer_children=True, structures=None, **kwargs)[source]

Convert this chain sequence to chain.

Note

Pass add_to_children=True to transfer structure to each child if transfer_children=True.

Parameters:
  • transfer_children (bool) – Transfer existing children.

  • structures (abc.Sequence[ChainStructure] | None) – Add structures to the created chain.

  • kwargs – Passed to Chain.add_structure

Returns:

Return type:

Chain

as_df()[source]
Returns:

The pandas DataFrame representation of the sequence where each column correspond to a sequence or map.

Return type:

DataFrame

as_np()[source]
Returns:

The numpy representation of a sequence as matrix. This is a shortcut to as_df() and getting df.values.

Return type:

ndarray

coverage(map_names=None, save=True, prefix='cov')[source]

Calculate maps’ coverage, i.e., the number of non-empty elements.

Parameters:
  • map_names (Sequence[str] | None) – optionally, provide the sequence of map names to calculate the coverage for.

  • save (bool) – save the results to meta

  • prefix (str) – if save is True, format keys f”{prefix}_{name}” for the meta dictionary.

Returns:

Return type:

dict[str, float]

fill(other, template, target, link_name, link_points_to, keep=True, target_new_name=None, empty_template=(None, ), empty_target=(None, ), transform=<function identity>)[source]

Fill-in a sequence in other using a template sequence from here.

As an example, consider two related sequences, s and o, mapped to the same reference numbering scheme r, which we’ll denote as a “link sequence.”

We would like to fill in “X” residues within o with residues from s. Let’s first try this:

>>> s = ChainSequence.from_string('ABCD', r=[10, 11, 12, 13])
>>> o = ChainSequence.from_string('AABXDE', r=[9, 10, 11, 12, 13, 14])
>>> s.fill(o,'seq1','seq1','r','r')
['A', 'A', 'B', 'X', 'D', 'E']

In the example above, “X” was not replaced because it’s not considered and “empty” target element requiring replacement. Below, we’ll provide a tuple of possible empty values and pass a transform function that will join the result back into str.

>>> s.fill(o,'seq1','seq1','r','r',empty_target=('X', ),transform="".join)
'AABCDE'
>>> o['seq1_patched'] == 'AABCDE'
True
Parameters:
  • other (t.Self) – Some other chain sequence.

  • template (str) – The name of the template sequence.

  • target (str) – Target sequence name within other to patch.

  • link_name (str) – Name of the map within other that links it with this sequence.

  • link_points_to (str | None) – Name of the map within this chain sequence that corresponding to link_name within other. If None, it is assumed to be the same as link_name.

  • keep (bool) – Keep patched sequence within other.

  • target_new_name (str | None) – Name of the patched sequence to save within other if keep is True. If this or target names are “seq1”, will use “seq1_patched” as target_new_name as this sequence is considered immutable by convention.

  • empty_target (tuple[t.Any, ...] | abc.Callable[[T], bool]) – A tuple of element instances or a callable. If tuple, a target element will be replaced with the corresponding element from`template` if it’s within this tuple. If callable, should accept an element of the target sequence and output True if it should be replaced with an element from the template and False otherwise.

  • empty_template (tuple[t.Any, ...] | abc.Callable[[T], bool]) – Same as empty_target but applied to a template character, with reverse meaning for True and False of the empty_target param.

  • transform (abc.Callable[[list[T]], abc.Sequence[R]]) – A function that transforms the result from one sequence to another.

Returns:

A patched mapping/sequence after applying the transform function.

Return type:

abc.Sequence[R]

filter_children(pred, inplace=False)[source]

Filter children using some predicate.

Parameters:
  • pred (FilterT[ChainSequence]) – Some callable accepting chain sequence and returning bool.

  • inplace (bool) – Filter children in place. Otherwise, return a copy with only children transformed.

Returns:

A chain sequence with filtered children.

Return type:

t.Self

classmethod from_df(df, name='S', meta=None)[source]

Init sequence from a data frame.

Parameters:
  • df (Path | pd.DataFrame) – Path to a tsv file or a pandas DataFrame.

  • name (str) – Name of a new chain sequence.

  • meta (dict[str, t.Any] | None) – Meta info of a new chain sequence.

Returns:

Initialized chain sequence.

Return type:

t.Self

classmethod from_file(inp, reader=<function read_fasta>, start=None, end=None, name=None, meta=None, **kwargs)[source]

Initialize chain sequence from file.

Parameters:
  • inp (Path | TextIOBase | Iterable[str]) – Path to a file or file handle or iterable over file lines.

  • reader (SeqReader) – A function to parse the sequence from inp.

  • start (int | None) – Start coordinate of a sequence in a file. If not provided, assumed to be 1.

  • end (int | None) – End coordinate of a sequence in a file. If not provided, will evaluate to the sequence’s length.

  • name (str | None) – Name of a sequence in inp. If not provided, will evaluate to a sequence’s header.

  • meta (dict[str, Any] | None) – Meta-info to add for the sequence.

  • kwargs – Additional sequences other than seq1 (as used during initialization via _seq attribute).

Returns:

Initialized chain sequence.

Return type:

ChainSequence

classmethod from_string(s, start=None, end=None, name='S', meta=None, **kwargs)[source]

Initialize chain sequence from string.

Parameters:
  • s (str) – String to init from.

  • start (int | None) – Start coordinate (default=1).

  • end (int | None) – End coordinate(default=len(s)).

  • name (str) – Name of a new chain sequence.

  • meta (dict[str, Any] | None) – Meta info of a new sequence.

  • kwargs – Additional sequences other than seq1 (as used during initialization via _seq attribute).

Returns:

Initialized chain sequence.

Return type:

ChainSequence

classmethod from_tuple(inp, start=None, end=None, meta=None, **kwargs)[source]
get_closest(key, value, *, reverse=False)[source]

Find the closest item for which item.key >=/<= value. By default, the search starts from the sequence’s beginning, and expands towards the end until the first element for which the retrieved value >= the provided value. If the reverse is True, the search direction is reversed, and the comparison operator becomes <=

>>> s = ChainSequence(1, 4, 'CS', seqs={'seq1': 'ABCD', 'X': [5, 6, 7, 8]})
>>> s.get_closest('seq1', 'D')
Item(i=4, seq1='D', X=8)
>>> s.get_closest('X', 0)
Item(i=1, seq1='A', X=5)
>>> assert s.get_closest('X', 0, reverse=True) is None
Parameters:
  • key (str) – map name.

  • value (Ord) – map value. Must support comparison operators.

  • reverse (bool) – reverse the sequence order and the comparison operator.

Returns:

The first relevant item or None if no relevant items were found.

Return type:

NamedTupleT | None

get_item(key, value)[source]

Get a specific item. Same as get_map(), but uses value to retrieve the needed item immediately.

(!) Use it when a single item is needed. For multiple queries for the same sequence, please use get_map().

>>> s = ChainSequence.from_string('ABC', name='CS')
>>> s.get_item('seq1', 'B').i
2
Parameters:
  • key (str) – map name.

  • value (Any) – sequence value of the sequence under the key name.

Returns:

an item correpsonding to the desired sequence element.

Return type:

NamedTupleT

get_map(key, to=None, rm_empty=False)[source]

Obtain the mapping of the form “key->item(seq_name=*,…)”.

>>> s = ChainSequence.from_string('ABC', name='CS')
>>> s.get_map('i')
{1: Item(i=1, seq1='A'), 2: Item(i=2, seq1='B'), 3: Item(i=3, seq1='C')}
>>> s.get_map('seq1')
{'A': Item(i=1, seq1='A'), 'B': Item(i=2, seq1='B'), 'C': Item(i=3, seq1='C')}
>>> s.add_seq('S', [1, 2, np.nan])
>>> s.get_map('seq1', 'S', rm_empty=True)
{'A': 1, 'B': 2}
Parameters:
  • key (str) – A _seq name to map from.

  • to (str | None) – A _seq name to map to.

  • rm_empty (bool) – Remove empty keys and values. A numeric value is empty if it is of type NaN. A string value is empty if it is an empty string ("").

Returns:

dict mapping key values to items.

Return type:

dict[Hashable, Any]

iter_children()[source]

Iterate over a child tree in topological order.

>>> s = ChainSequence(1, 10, 'CS', seqs={'seq1': 'A' * 10})
>>> ss = s.spawn_child(1, 5, 'CS_')
>>> sss = ss.spawn_child(1, 3, 'CS__')
>>> list(s.iter_children())
[[CS_|1-5<-(CS|1-10)], [CS__|1-3<-(CS_|1-5<-(CS|1-10))]]
Returns:

a generator over child tree levels, starting from the children and expanding such attributes over ChainSequence instances within this attribute.

Return type:

Generator[ChainList[ChainSequence], None, None]

classmethod make_empty(**kwargs)[source]
Returns:

An empty chain sequence.

Return type:

ChainSequence

map_boundaries(start, end, map_name, closest=False)[source]

Map the provided boundaries onto sequence.

A convenient interface for common task where one wants to find sequence elements corresponding to arbitrary boundaries.

>>> s = ChainSequence.from_string('XXSEQXX', name='CS')
>>> s.add_seq('NCS', list(range(10, 17)))
>>> s.map_boundaries(1, 3, 'i')
(Item(i=1, seq1='X', NCS=10), Item(i=3, seq1='S', NCS=12))
>>> s.map_boundaries(5, 12, 'NCS', closest=True)
(Item(i=1, seq1='X', NCS=10), Item(i=3, seq1='S', NCS=12))
Parameters:
  • start (Ord) – Some orderable object.

  • end (Ord) – Some orderable object.

  • map_name (str) – Use this sequence to search for boundaries. It is assumed that map_name in self is True.

  • closest (bool) – If true, instead of exact mapping, search for the closest elements.

Returns:

a tuple with two items corresponding to mapped start and end.

Return type:

tuple[NamedTupleT, NamedTupleT]

map_numbering(other, align_method=<function mafft_align>, save=True, name='S', **kwargs)[source]

Map the numbering(): of another sequence onto this one. For this, align primary sequences and relate their numbering.

>>> s = ChainSequence.from_string('XXSEQXX', name='CS')
>>> o = ChainSequence.from_string('SEQ', name='CSO')
>>> s.map_numbering(o)
[None, None, 1, 2, 3, None, None]
>>> assert 'map_CSO' in s
>>> a = Alignment([('CS1', 'XSEQX'), ('CS2', 'XXEQX')])
>>> s.map_numbering(a, name='map_aln')
[None, 1, 2, 3, 4, 5, None]
>>> assert 'map_aln' in s
Parameters:
  • other (str | tuple[str, str] | ChainSequence | Alignment) – another chain _seq.

  • align_method (AlignMethod) – a method to use for alignment.

  • save (bool) – save the numbering as a sequence.

  • name (str) – a name to use if save is True.

  • kwargs – passed to func:map_pairs_numbering.

Returns:

a list of integers with None indicating gaps.

Return type:

list[None | int]

match(map_name1, map_name2, as_fraction=True, save=True, name='auto')[source]
Parameters:
  • map_name1 (str) – Mapping name 1.

  • map_name2 (str) – Mapping name 2.

  • as_fraction (bool) – Divide by the total length.

  • save (bool) – Save the result to meta.

  • name (str) – Name of the saved metadata entry. If “auto”, will derive from given map names.

Returns:

The total number or a fraction of matching characters between maps.

Return type:

float

patch(other, numerator, link_name, link_points_to, diff=<built-in function sub>, num_filter=<function ChainSequence.<lambda>>, **kwargs)[source]

Patch the gaps in the provided sequence using this sequence as template.

The existence of a gap is judged by the numerator map that should point to a numeration scheme. If there are two consecutive numerator elements, for which diff returns value greater than one, this is considered a gap that could be filled in by a template.

To relate a potential gap to the template sequence, a link sequence must exist in the provided sequence, containing values referencing the template.

As an example, consider the template sequence “ABCDEG” and the sequence requiring patching “BDEG”. Let e be the numbering of the “BDEG”, e=[1, 4, 5, 6] and r=[2, 4, 5, 6] be a link map that points to the segment indices of the template.

>>> template = ChainSequence.from_string("ABCDEG", name='T')
>>> seq = ChainSequence.from_string("BDEG", name='P', e=[1,4,6,7], r=[2,4,5,6])

Observe that there is a numeration gap between 1 and 4. The corresponding elements of r point to the template indices 2 an 4. Thus, there is a gap that can be filled in by a portion of the template between 2 and 4. Here, it turns out to be singleton sequence element “C” at position 3. This segment will be inserted into the patched sequence:

>>> patched = template.patch(seq,'e','r','i')
>>> patched.id
'P|1-5'
>>> patched.seq1
'BCDEG'

Similar to patch(), the sequence elements missing in either of the sequences will be filled-in. Thus, what happens to the original numeration e?

>>> patched['e']
[1, None, 4, 6, 7]

On the other hand, the link sequence r can be successfully filled in by the template:

>>> patched['r']
[2, 3, 4, 5, 6]

Note

If this segment is empty or singleton, the other is returned unchanged.

Warning

This operation creates a new segment. The parents and metadata won’t be transferred.

See also

lXtractor.core.segment.Segment.insert() used to insert segments while patching.

Parameters:
  • other (t.Self) – A sequence to patch.

  • numerator (str) – A map name in other containing numeration scheme the gaps will be inferred from.

  • link_name (str) – A map name in other with values referencing some sequence in this instance.

  • link_points_to (str) – A map name in this instance that the link_name refers to in other.

  • diff (abc.Callable[[T, T], int]) – A callable accepting two numerator elements – higher and lower ones – and returning the number of elements between them. By default, a simple substraction is used.

  • num_filter (abc.Callable[[t.Any], bool]) – An optional filter function to filter out elements in the numerator before splitting it into consecutive pairs. By default, this function will filter out any None values.

  • kwargs – Additional keyword arguments passed to meth:lXtractor.core.segment.Segment.insert.

Returns:

A new patched segment.

Return type:

t.Self

classmethod read(base_dir, *, search_children=False)[source]

Initialize chain sequence from dump created using write().

Parameters:
  • base_dir (Path) – A path to a dump dir.

  • search_children (bool) – Recursively search for child segments and populate the children

Returns:

Initialized chain sequence.

Return type:

t.Self

relate(other, map_name, link_name, link_points_to='i', keep=True, map_name_in_other=None)[source]

Relate mapping from this sequence with other via some common “link” sequence.

The “link” sequence is a part of the other pointing to some sequence within this instance.

As an example, consider the case of transferring the mapping to alignment positions aln_map. To do this, the other must be mapped to some sequence within this instance – typically to canonical numbering – via some stored map_canonical sequence.

Thus, one would use ..code-block:: python

this.relate(

other, map_name=aln_map, link_name=map_canonical, link_name_points_to=”i”

)

In the example below, we transfer map_some sequence from s to o via sequence L pointing to the primary sequence of s:

seq1    : A B C D   ---|
map_some: 9 8 7 6      | --> 9 8 None 6 (map transferred to `o`)
          | | | |      |
seq1    : X Y Z R      |
L       : A B X D   ---|
>>> s = ChainSequence.from_string('ABCD', name='CS')
>>> s.add_seq('map_some', [9, 8, 7, 6])
>>> o = ChainSequence.from_string('XYZR', name='XY')
>>> o.add_seq('L', ['A', 'B', 'X', 'D'])
>>> assert 'L' in o
>>> s.relate(o,map_name='map_some', link_name='L', link_points_to='seq1')
[9, 8, None, 6]
>>> assert o['map_some'] == [9, 8, None, 6]
Parameters:
  • other (t.Self) – An arbitrary chain sequence.

  • map_name (str) – The name of the sequence to transfer.

  • link_name (str) – The name of the “link” sequence that connects self and other.

  • link_points_to (str) – Values within this instance the “link” sequence points to.

  • keep (bool) – Store the obtained sequence within the other.

  • map_name_in_other (str | None) – The name of the mapped sequence to store within the other. By default, the map_name is used.

Returns:

The mapped sequence.

Return type:

list[t.Any]

rename(name)[source]

Rename this sequence by modifying the name.

Note

This is a mutable operation. Returning a copy of this sequence upon renaming will create two identical sequences with different IDs, which is discouraged.

Parameters:

name (str) – New name.

Returns:

The same sequence with a new name.

Return type:

t.Self

spawn_child(start, end, name=None, category=None, *, map_from=None, map_closest=False, deep_copy=False, keep=True)[source]

Spawn the sub-sequence from the current instance.

Child sequence’s boundaries must be within this sequence’s boundaries.

Uses Segment.sub() method.

>>> s = ChainSequence(
...     1, 4, 'CS',
...     seqs={'seq1': 'ABCD', 'X': [5, 6, 7, 8]}
... )
>>> child1 = s.spawn_child(1, 3, 'Child1')
>>> assert child1.id in s.children
>>> s.children
[Child1|1-3<-(CS|1-4)]
Parameters:
  • start (int) – Start of the sub-sequence.

  • end (int) – End of the sub-sequence.

  • name (str | None) – Spawned child sequence’s name.

  • category (str | None) – Spawned child category. Any meaningful tag string that could be used later to group similar children.

  • map_from (str | None) – Optionally, the map name the boundaries correspond to.

  • map_closest (bool) – Map to closest start, end boundaries (see map_boundaries()).

  • deep_copy (bool) – Deep copy inherited sequences.

  • keep (bool) – Save child sequence within children.

Returns:

Spawned sub-sequence.

Return type:

ChainSequence

summary(meta=True, children=False)[source]
Return type:

DataFrame

write(dest, *, write_children=False)[source]

Dump this chain sequence. Creates sequence.tsv and meta.tsv in base_dir using write_seq() and write_meta().

Parameters:
  • dest (Path) – Destination directory.

  • write_children (bool) – Recursively write children.

Returns:

Path to the directory where the files are written.

Return type:

Path

write_meta(path, sep='\t')[source]

Write meta information as {key}{sep}{value} lines.

Parameters:
  • path (Path) – Write destination file.

  • sep – Separator between key and value.

Returns:

Nothing.

write_seq(path, fields=None, sep='\t')[source]

Write the sequence (and all its maps) as a table.

Parameters:
  • path (Path) – Write destination file.

  • fields (list[str] | None) – Optionally, names of sequences to dump.

  • sep (str) – Table separator. Please use the default to avoid ambiguities and keep readability.

Returns:

Nothing.

property categories: list[str]
Returns:

A list of categories associated with this object.

Categories are kept under “category” field in meta as a “,”-separated list of strings. For instance, “domain,family_x”.

property fields: tuple[str, ...]
Returns:

Names of the currently stored sequences.

property numbering: Sequence[int]
Returns:

the primary sequence’s (seq1()) numbering.

property seq: t.Self

This property exists for functionality relying on the .seq attribute.

Returns:

This object.

property seq1: str
Returns:

the primary sequence.

property seq3: Sequence[str]
Returns:

the three-letter codes of a primary sequence.

lXtractor.chain.sequence.map_numbering_12many(obj_to_map, seqs, num_proc=1, verbose=False, **kwargs)[source]

Map numbering of a single sequence to many other sequences.

This function does not save mapped numberings.

Parameters:
  • obj_to_map (str | tuple[str, str] | ChainSequence | Alignment) – Object whose numbering should be mapped to seqs.

  • seqs (Iterable[ChainSequence]) – Chain sequences to map the numbering to.

  • num_proc (int) – A number of parallel processes to use.

  • verbose (bool) – Output progress bar.

  • kwargs – Passed to lXtractor.util.misc.apply().

Returns:

An iterator over the mapped numberings.

Return type:

Iterator[list[int | None]]

lXtractor.chain.sequence.map_numbering_many2many(objs_to_map, seq_groups, num_proc=1, verbose=False, **kwargs)[source]

Map numbering of each object o in objs_to_map to each sequence in each group of the seq_groups

o1 -> s1_1 s1_1 s1_3 ...
o2 -> s2_1 s2_1 s2_3 ...
          ...

This function does not save mapped numberings.

For a single object-group pair, it’s the same as map_numbering_12many(). The benefit comes from parallelization of this functionality.

Parameters:
  • objs_to_map (Sequence[str | tuple[str, str] | ChainSequence | Alignment]) – An iterable over objects whose numbering to map.

  • seq_groups (Sequence[Sequence[ChainSequence]]) – Group of objects to map numbering to.

  • num_proc (int) – A number of processes to use.

  • verbose (bool) – Output a progress bar.

  • kwargs – Passed to lXtractor.util.misc.apply().

Returns:

An iterator over lists of lists with numeric mappings

Return type:

Iterator[list[list[int | None]]]

[[s1_1 map, s1_2 map, ...]
 [s2_1 map, s2_2 map, ...]
           ...
 ]

lXtractor.chain.structure module

class lXtractor.chain.structure.ChainStructure(structure, chain_id=None, structure_id=None, seq=None, parent=None, children=None, variables=None)[source]

Bases: object

A structure of a single chain.

Typical usage workflow:

  1. Use :meth:`GenericStructure.read <lXtractor.core.structure.

    GenericStructure.read>` to parse the file.

  2. Split into chains using :meth:`split_chains <lXtractor.core.structure.

    GenericStructure.split_chains>`.

  3. Initialize ChainStructure from each chain via

    from_structure().

s = GenericStructure.read(Path("path/to/structure.cif"))
chain_structures = [
    ChainStructure.from_structure(c) for c in s.split_chains()
]

Two main containers are:

  1. _seq – a ChainSequence of this structure,

    also containing meta info.

  2. pdb – a container with pdb id, pdb chain id,

    and the structure itself.

A unique structure is defined by

__init__(structure, chain_id=None, structure_id=None, seq=None, parent=None, children=None, variables=None)[source]
Parameters:
  • structure_id (str | None) – An ID for the structure the chain was taken from.

  • chain_id (str | None) – A chain ID (e.g., “A”, “B”, etc.)

  • structure (GenericStructure | bst.AtomArray | None) – Parsed generic structure with a single chain.

  • seq (ChainSequence | None) – Chain sequence of a structure. If not provided, will use get_sequence.

  • parent (ChainStructure | None) – Specify parental structure.

  • children (abc.Iterable[ChainStructure] | None) – Specify structures descended from this one. This contained is used to record sub-structures obtained via spawn_child().

  • variables (Variables | None) – Variables associated with this structure.

Raises:

InitError – If invalid (e.g., multi-chain structure) is provided.

apply_children(fn, inplace=False)[source]

Apply some function to children.

Parameters:
  • fn (ApplyT[ChainStructure]) – A callable accepting and returning the chain structure type instance.

  • inplace (bool) – Apply to children in place. Otherwise, return a copy with only children transformed.

Returns:

A chain structure with transformed children.

Return type:

t.Self

filter_children(pred, inplace=False)[source]

Filter children using some predicate.

Parameters:
  • pred (FilterT[ChainStructure]) – Some callable accepting chain structure and returning bool.

  • inplace (bool) – Filter children in place. Otherwise, return a copy with only children transformed.

Returns:

A chain structure with filtered children.

Return type:

t.Self

iter_children()[source]

Iterate children in topological order.

See ChainSequence.iter_children() and topo_iter().

Return type:

Generator[list[ChainStructure], None, None]

classmethod make_empty()[source]

Create an empty chain structure.

Returns:

An empty chain structure.

Return type:

ChainStructure

classmethod read(base_dir, *, search_children=False, **kwargs)[source]

Read the chain structure from a file disk dump.

Parameters:
  • base_dir (Path) – An existing dir containing structure, structure sequence, meta info, and (optionally) any sub-structure segments.

  • dump_names – File names container.

  • search_children (bool) – Recursively search for sub-segments and populate children.

  • kwargs – Passed to lXtractor.core.structure.GenericStructure.read().

Returns:

An initialized chain structure.

Return type:

t.Self

rm_solvent(copy=False)[source]

Remove solvent “residues” from this structure.

Parameters:

copy (bool) – Copy an atom array that results from solvent removal.

Returns:

A new instance without solvent molecules.

Return type:

t.Self

spawn_child(start, end, name=None, category=None, *, map_from=None, map_closest=True, keep_seq_child=False, keep=True, deep_copy=False, tolerate_failure=False, silent=False)[source]

Create a sub-structure from this one. Start and end have inclusive boundaries.

Parameters:
  • start (int) – Start coordinate.

  • end (int) – End coordinate.

  • name (str | None) – The name of the spawned sub-structure.

  • category (str | None) – Spawned child category. Any meaningful tag string that could be used later to group similar children.

  • map_from (str | None) – Optionally, the map name the boundaries correspond to.

  • map_closest (bool) – Map to closest start, end boundaries (see map_boundaries()).

  • keep_seq_child (bool) – Keep spawned sub-sequence within ChainSequence.children. Beware that it’s best to use a single object type for keeping parent-children relationships to avoid duplicating information.

  • keep (bool) – Keep spawned substructure in children.

  • deep_copy (bool) – Deep copy spawned sub-sequence and sub-structure.

  • tolerate_failure (bool) – Do not raise the ``InitError` if the resulting structure subset is empty,

  • silent (bool) – Do not display warnings if tolerate_failure is True.

Returns:

New chain structure – a sub-structure of the current one.

Return type:

ChainStructure

summary(meta=True, children=False, ligands=False)[source]
Return type:

DataFrame

superpose(other, res_id=None, atom_names=None, map_name_self=None, map_name_other=None, mask_self=None, mask_other=None, inplace=False, rmsd_to_meta=True)[source]

Superpose some other structure to this one. It uses func:biotite.structure.superimpose internally.

The most important requirement is both structures (after all optional selections applied) having the same number of atoms.

Parameters:
  • other (ChainStructure) – Other chain structure (mobile).

  • res_id (Sequence[int] | None) – Residue positions within this or other chain structure. If None, use all available residues.

  • atom_names (Sequence[Sequence[str]] | Sequence[str] | None) –

    Atom names to use for selected residues. Two options are available:

    1) Sequence of sequences of atom names. In this case, atom names are given per selected residue (res_id), and the external sequence’s length must correspond to the number of residues in the res_id. Note that if no res_id provided, the sequence must encompass all available residues.

    2) A sequence of atom names. In this case, it will be used to select atoms for each available residues. For instance, use atom_names=["CA", "C", "N"] to select backbone atoms.

  • map_name_self (str | None) – Use this map to map res_id to real numbering of this structure.

  • map_name_other (str | None) – Use this map to map res_id to real numbering of the other structure.

  • mask_self (ndarray | None) – Per-atom boolean selection mask to pick fixed atoms within this structure.

  • mask_other (ndarray | None) – Per-atom boolean selection mask to pick mobile atoms within the other structure. Note that mask_self and mask_other take precedence over other selection specifications.

  • inplace (bool) – Apply the transformation to the mobile structure inplace, mutating other. Otherwise, make a new instance: same as other, but with transformed atomic coordinates of a pdb.structure.

  • rmsd_to_meta (bool) – Write RMSD to the meta of other as “rmsd

Returns:

A tuple with (1) transformed chain structure, (2) transformation RMSD, and (3) transformation matrices (see func:biotite.structure.superimpose for details).

Return type:

tuple[ChainStructure, float, tuple[ndarray, ndarray, ndarray]]

write(dest, fmt='mmtf.gz', *, write_children=False)[source]

Write this object into a directory. It will create the following files:

  1. meta.tsv

  2. sequence.tsv

  3. structure.fmt

Existing files will be overwritten.

Parameters:
  • dest (Path) – A writable dir to save files to.

  • fmt (str) – Structure format to use. Supported formats are “pdb”, “cif”, and “mmtf”. Adding “.gz” (eg, “mmtf.gz”) will lead to gzip compression.

  • write_children (bool) – Recursively write children.

Returns:

Path to the directory where the files are written.

Return type:

Path

property altloc: str
Returns:

An altloc ID.

property array: AtomArray
Returns:

The AtomArray object (a shortcut for .pdb.structure.array).

property categories: list[str]
Returns:

A list of categories encapsulated within ChainSequence.meta.

property chain_id: str
children: ChainList[ChainStructure]

Any sub-structures descended from this one, preferably using spawn_child().

property end: int
Returns:

Structure sequence’s end

property id: str
Returns:

ChainStructure identifier in the format “ChainStructure({_seq.id}|{alt_locs})<-(parent.id)”.

property is_empty: bool
Returns:

True if the structure is empty and False otherwise.

property ligands: tuple[Ligand, ...]
Returns:

A list of connected ligands.

property meta: dict[str, str]
Returns:

Meta info of a _seq.

property name: str | None
Returns:

Structure sequence’s name

property parent: t.Self | None
property seq: ChainSequence
property start: int
Returns:

Structure sequence’s start

property structure: GenericStructure
variables: Variables

Variables assigned to this structure. Each should be of a lXtractor.variables.base.StructureVariable.

lXtractor.chain.structure.filter_selection_extended(c, pos=None, atom_names=None, map_name=None, exclude_hydrogen=False, tolerate_missing=False)[source]

Get mask for certain positions and atoms of a chain structure.

Parameters:
  • c (ChainStructure) – Arbitrary chain structure.

  • pos (Sequence[int] | None) – A sequence of positions.

  • atom_names (Sequence[Sequence[str]] | Sequence[str] | None) – A sequence of atom names (broadcasted to each position in res_id) or an iterable over such sequences for each position in res_id.

  • map_name (str | None) – A map name to map from pos to numbering

  • exclude_hydrogen (bool) – For convenience, exclude hydrogen atoms. Especially useful during pre-processing for superposition.

  • tolerate_missing (bool) – If certain positions failed to map, does not raise an error.

Returns:

A binary mask, True for selected atoms.

Return type:

ndarray

lXtractor.chain.structure.subset_to_matching(reference, c, map_name=None, skip_if_match='seq1', **kwargs)[source]

Subset both chain structures to aligned residues using sequence alignment.

Note

It’s not necessary, but it makes sense for c1 and c2 to be somehow related.

Parameters:
  • reference (ChainStructure) – A chain structure to align to.

  • c (ChainStructure) – A chain structure to align.

  • map_name (str | None) – If provided, c is considered “pre-aligned” to the reference, and reference possessed the numbering under map_name.

  • skip_if_match (str) –

    Two options:

    1. Sequence/Map name, e.g., “seq1” – if sequences under this name match exactly, skip alignment and return original chain structures.

    2. “len” – if sequences have equal length, skip alignment and return original chain structures.

Returns:

A pair of new structures having the same number of residues that were successfully matched during the alignment.

Return type:

tuple[ChainStructure, ChainStructure]

lXtractor.chain.chain module

class lXtractor.chain.chain.Chain(seq, structures=None, parent=None, children=None)[source]

Bases: object

A container, encompassing a ChainSequence and possibly many ChainStructure’s corresponding to a single protein chain.

A typical use case is when one wants to benefit from the connection of structural and sequential data, e.g., using single full canonical sequence as _seq and all the associated structures within structures. In this case, this data structure makes it easier to extract, annotate, and calculate variables using canonical sequence mapped to the sequence of a structure.

Typical workflow:

  1. Initialize from some canonical sequence.

  2. Add structures and map their sequences.

  3. ???

  4. Do something useful, like calculate variables using canonical

    sequence’s positions.

c = Chain.from_sequence((header, _seq))
for s in structures:
    c.add_structure(s)
__init__(seq, structures=None, parent=None, children=None)[source]
Parameters:
  • seq (ChainSequence) – A chain sequence.

  • structures (Iterable[ChainStructure] | None) – Chain structures corresponding to a single protein chain specified by _seq.

  • parent (Chain | None) – A parent chain this chain had descended from.

  • children (Iterable[Chain] | None) – A collection of children.

add_structure(structure, *, check_ids=True, map_to_seq=True, map_name='map_canonical', add_to_children=False, **kwargs)[source]

Add a structure to structures.

Parameters:
  • structure (ChainStructure) – A structure of a single chain corresponding to _seq.

  • check_ids (bool) – Check that existing structures don’t encompass the structure with the same id().

  • map_to_seq (bool) – Align the structure sequence to the _seq and create a mapping within the former.

  • map_name (str) – If map_to_seq is True, use this map name.

  • add_to_children (bool) – If True, will recursively add structure to existing children according to their boundaries mapped to the structure’s numbering. Consequently, this requires mapping, i.e., map_to_seq=True.

  • kwargs – Passed to ChainSequence.map_numbering().

Returns:

Mutates structures and returns nothing.

Raises:

ValueError – If check_ids is True and the structure id clashes with the existing ones.

apply_children(fn, inplace=False)[source]

Apply some function to children.

Parameters:
  • fn (ApplyT[Chain]) – A callable accepting and returning the chain type instance.

  • inplace (bool) – Apply to children in place. Otherwise, return a copy with only children transformed.

Returns:

A chain with transformed children.

Return type:

t.Self

apply_structures(fn, inplace=False)[source]

Apply some function to structures.

Parameters:
  • fn (ApplyT[ChainStructure]) – A callable accepting and returning a chain structure.

  • inplace (bool) – Apply to structures in place. Otherwise, return a copy with only children transformed.

Returns:

A chain with transformed structures.

Return type:

t.Self

filter_children(pred, inplace=False)[source]

Filter children using some predicate.

Parameters:
  • pred (FilterT[Chain]) – Some callable accepting chain and returning bool.

  • inplace (bool) – Filter children in place. Otherwise, return a copy with only children transformed.

Returns:

A chain with filtered children.

Return type:

t.Self

filter_structures(pred, inplace=False)[source]

Filter chain structures.

Parameters:
  • pred (FilterT[ChainStructure]) – A callable accepting a chain structure and returning bool.

  • inplace (bool) – Filter structures in place. Otherwise, return a copy with only children transformed.

Returns:

A chain with filtered structures.

Return type:

t.Self

generate_patched_seqs(numbering='numbering', link_name='map_canonical', link_points_to='i', **kwargs)[source]

Generate patched sequences from chain structure sequences.

For explanation of the patching process see lXtractor.chain.sequence.ChainSequence.patch().

Parameters:
  • numbering (str) – Map name referring to a numbering scheme to infer gaps from.

  • link_name (str) – Map name linking structure sequence to the canonical sequence.

  • link_points_to (str) – Map name in the canonical sequence that link_name refers to.

  • kwargs – Passed to lXtractor.chain.sequence.ChainSequence.patch().

Returns:

A generator over patched structure sequences.

Return type:

Generator[ChainSequence, None, None]

iter_children()[source]

Iterate children in topological order.

See ChainSequence.iter_children() and topo_iter().

Returns:

Iterator over levels of a child tree.

Return type:

Generator[list[Chain], None, None]

classmethod make_empty()[source]
Return type:

t.Self

classmethod read(path, *, search_children=False)[source]
Parameters:
  • path (Path) – A path to a directory with at least sequence and metadata files.

  • search_children (bool) – Recursively search for child segments and populate children.

Returns:

An initialized chain.

Return type:

Chain

spawn_child(start, end, name=None, category=None, *, subset_structures=True, tolerate_failure=False, silent=False, keep=True, seq_deep_copy=False, seq_map_from=None, seq_map_closest=True, seq_keep_child=False, str_deep_copy=False, str_map_from=None, str_map_closest=True, str_keep_child=True, str_seq_keep_child=False, str_min_size=1, str_accept_fn=<function Chain.<lambda>>)[source]

Subset a _seq and (optionally) each structure in structures using the provided _seq boundaries (inclusive).

Parameters:
  • start (int) – Start coordinate.

  • end (int) – End coordinate.

  • name (str | None) – Name of a new chain.

  • category (str | None) – Spawned child category. Any meaningful tag string that could be used later to group similar children.

  • subset_structures (bool) – If True, subset each structure in structures. If False, structures are not inherited.

  • tolerate_failure (bool) – If True, a failure to subset a structure doesn’t raise an error.

  • silent (bool) – Supress warnings for errors when tolerate_failure is True.

  • keep (bool) – Save created child to children.

  • seq_deep_copy (bool) – Deep copy potentially mutable sequences within _seq.

  • seq_map_from (str | None) – Use this map to obtain coordinates within _seq.

  • seq_map_closest (bool) – Map to the closest matching coordinates of a _seq. See ChainSequence.map_boundaries() and ChainSequence.find_closest().

  • seq_keep_child (bool) – Keep a spawned ChainSequence as a child within _seq. Should be False if keep is True to avoid data duplication.

  • str_deep_copy (bool) – Deep copy each sub-structure.

  • str_map_from (str | None) – Use this map to obtain coordinates within ChainStructure._seq of each structure.

  • str_map_closest (bool) – Map to the closest matching coordinates of a _seq. See ChainSequence.map_boundaries() and ChainSequence.find_closest().

  • str_keep_child (bool) – Keep a spawned sub-structure as a child in ChainStructure.children. Should be False if keep is True to avoid data duplication.

  • str_seq_keep_child (bool) – Keep a sub-sequence of a spawned structure within the ChainSequence.children of ChainStructure._seq of a spawned structure. Should be False if keep or str_keep_child is True to avoid data duplication.

  • str_min_size (int | float) – A minimum number of residues in a structure to be accepted after subsetting.

  • str_accept_fn (abc.Callable[[ChainStructure], bool]) – A filter function accepting a ChainStructure and returning a boolean value indicating whether this structure should be retained in structures.

Returns:

A sub-chain with sub-sequence and (optionally) sub-structures.

Return type:

t.Self

summary(meta=True, children=False, structures=True)[source]
Return type:

DataFrame

transfer_seq_mapping(map_name, link_map='map_canonical', link_map_points_to='i', **kwargs)[source]

Transfer sequence mapping to each ChainStructure._seq within structures.

This method simply utilizes ChainSequence.relate() to transfer some map from the _seq to each ChainStructure._seq. Check ChainSequence.relate() for an explanation.

Parameters:
  • map_name (str) – The name of the map to transfer.

  • link_map (str) – A name of the map existing within ChainStructure._seq of each structure in structures.

  • link_map_points_to (str) – Which sequence values of the link_map point to.

  • kwargs – Passed to ChainSequence.relate()

Returns:

Nothing.

write(dest, *, str_fmt='mmtf.gz', write_children=True)[source]

Create a disk dump of this chain data. Created dumps can be reinitialized via read().

Parameters:
  • dest (Path) – A writable dir to hold the data.

  • str_fmt (str) – A format to write structures in.

  • write_children (bool) – Recursively write children.

Returns:

Path to the directory where the files are written.

Return type:

Path

property categories: list[str]
Returns:

A list of categories from _seq’s ChainSequence.meta.

children: ChainList[Chain]

A collection of children preferably obtained using spawn_child().

property end: int
Returns:

Structure sequence’s end

property id: str
Returns:

Chain identifier derived from its _seq ID.

property meta: dict[str, str]
Returns:

A seq()’s ChainSequence.meta.

property name: str | None
Returns:

Structure sequence’s name

property parent: t.Self | None
property seq: ChainSequence
property start: int
Returns:

Structure sequence’s start

structures: ChainList[ChainStructure]

lXtractor.chain.list module

The module defines the ChainList - a list of Chain*-type objects that behaves like a regular list but has additional bells and whistles tailored towards Chain* data structures.

class lXtractor.chain.list.ChainList(chains, categories=None)[source]

Bases: MutableSequence[CT]

A mutable single-type collection holding either Chain’s, or ChainSequence’s, or ChainStructure’s.

Object’s funtionality relies on this type purity. Adding of / contatenating with objects of a different type shall raise an error.

It behaves like a regular list with additional functionality.

>>> from lXtractor.chain import ChainSequence
>>> s = ChainSequence.from_string('SEQUENCE', name='S')
>>> x = ChainSequence.from_string('XXX', name='X')
>>> x.meta['category'] = 'x'
>>> cl = ChainList([s, s, x])
>>> cl
[S|1-8, S|1-8, X|1-3]
>>> cl[0]
S|1-8
>>> cl['S']
[S|1-8, S|1-8]
>>> cl[:2]
[S|1-8, S|1-8]
>>> cl['1-3']
[X|1-3]

Adding/appending/removing objects of a similar type is easy and works similar to a regular list.

>>> cl += [s]
>>> assert len(cl) == 4
>>> cl.remove(s)
>>> assert len(cl) == 3

Categories can be accessed as attributes or using [] syntax (similar to the Pandas.DataFrame columns).

>>> cl.x
[X|1-3]
>>> cl['x']
[X|1-3]

While creating a chain list, using a groups parameter will assign categories to sequences. Note that such operations return a new ChainList object.

>>> cl = ChainList([s, x], categories=['S', ['X1', 'X2']])
>>> cl.S
[S|1-8]
>>> cl.X2
[X|1-3]
>>> cl['X1']
[X|1-3]
__init__(chains, categories=None)[source]
Parameters:
  • chains (Iterable[CT]) – An iterable over Chain*-type objects.

  • categories (Iterable[str | Iterable[str]] | None) – An optional list of categories. If provided, they will be assigned to inputs’ meta attributes.

apply(fn, verbose=False, desc='Applying to objects', num_proc=1)[source]

Apply a function to each object and return a new chain list of results.

Parameters:
  • fn (ApplyT) – A callable to apply.

  • verbose (bool) – Display progress bar.

  • desc (str) – Progress bar description.

  • num_proc (int) – The number of CPUs to use. num_proc <= 1 indicates sequential processing.

Returns:

A new chain list with application results.

Return type:

ChainList[CT]

collapse()[source]

Collapse all objects and their children within this list into a new chain list. This is a shortcut for chain_list + chain_list.collapse_children().

Returns:

Collapsed list.

Return type:

ChainList[CT]

collapse_children()[source]

Collapse all children of each object in this list into a single chain list.

>>> from lXtractor.chain import ChainSequence
>>> s = ChainSequence.from_string('ABCDE', name='A')
>>> child1 = s.spawn_child(1, 4)
>>> child2 = child1.spawn_child(2, 3)
>>> cl = ChainList([s]).collapse_children()
>>> assert isinstance(cl, ChainList)
>>> cl
[A|1-4<-(A|1-5), A|2-3<-(A|1-4<-(A|1-5))]
Returns:

A chain list of all children.

Return type:

ChainList[CT]

drop_duplicates(key=<function ChainList.<lambda>>)[source]
Parameters:

key (abc.Callable[[CT], t.Hashable] | None) – A callable accepting the single element and returning some hashable object associated with that element.

Returns:

A new list with unique elements as judged by the key.

Return type:

t.Self

filter(pred)[source]
>>> from lXtractor.chain import ChainSequence
>>> cl = ChainList(
...     [ChainSequence.from_string('AAAX', name='A'),
...      ChainSequence.from_string('XXX', name='X')]
... )
>>> cl.filter(lambda c: c.seq1[0] == 'A')
[A|1-4]
Parameters:

pred (Callable[[CT], bool]) – Predicate callable for filtering.

Returns:

A filtered chain list (new object).

Return type:

ChainList[CT]

filter_category(name)[source]
Parameters:

name (str) – Category name.

Returns:

Filtered objects having this category within their meta["category"].

Return type:

ChainList

filter_pos(s, *, match_type='overlap', map_name=None)[source]

Filter to objects encompassing certain consecutive position regions or arbitrary positions’ collections.

For Chain and ChainStructure, the filtering is over _seq attributes.

Parameters:
  • s (lxs.Segment | abc.Collection[Ord]) –

    What to search for:

    1. s=Segment(start, end) to find all objects encompassing

      certain region.

    2. [pos1, posX, posN] to find all objects encompassing the

      specified positions.

  • match_type (str) –

    If s is Segment, this value determines the acceptable relationships between s and each ChainSequence:

    1. ”overlap” – it’s enough to overlap with s.

    2. ”bounding” – object is accepted if it bounds s.

    3. ”bounded” – object is accepted if it’s bounded by s.

  • map_name (str | None) –

    Use this map within to map positions of s. For instance, to each for all elements encompassing region 1-5 of a canonical sequence, one would use

    chain_list.filter_pos(
        s=Segment(1, 5), match_type="bounding",
        map_name="map_canonical"
    )
    

Returns:

A list of hits of the same type.

Return type:

ChainList[CS]

get_level(n)[source]

Get a specific level of a hierarchical tree starting from this list:

l0: this list
l1: children of each child of each object in l0
l2: children of each child of each object in l1
...
Parameters:

n (int) – The level index (0 indicates this list). Other levels are obtained via iter_children().

Returns:

A chain list of object corresponding to a specific topological level of a child tree.

Return type:

ChainList[CT]

groupby(key)[source]

Group sequences in this list by a given key.

Parameters:

key (abc.Callable[[CT], T]) – Some callable accepting a single chain and returning a grouper value.

Returns:

An iterator over pairs (group, chains), where chains is a chain list of chains that belong to group.

Return type:

abc.Iterator[tuple[T, t.Self]]

index(value[, start[, stop]]) integer -- return first index of value.[source]

Raises ValueError if the value is not present.

Supporting start and stop arguments is optional, but recommended.

Return type:

int

insert(index, value)[source]

S.insert(index, value) – insert value before index

iter_children()[source]

Simultaneously iterate over topological levels of children.

>>> from lXtractor.chain import ChainSequence
>>> s = ChainSequence.from_string('ABCDE', name='A')
>>> child1 = s.spawn_child(1, 4)
>>> child2 = child1.spawn_child(2, 3)
>>> x = ChainSequence.from_string('XXXX', name='X')
>>> child3 = x.spawn_child(1, 3)
>>> cl = ChainList([s, x])
>>> list(cl.iter_children())
[[A|1-4<-(A|1-5), X|1-3<-(X|1-4)], [A|2-3<-(A|1-4<-(A|1-5))]]
Returns:

An iterator over chain lists of children levels.

Return type:

Generator[ChainList[CT], None, None]

iter_ids()[source]

Iterate over ids of this chain list.

Returns:

An iterator over chain ids.

Return type:

Iterator[str]

iter_sequences()[source]
Returns:

An iterator over ChainSequence’s.

Return type:

abc.Generator[ChainSequence, None, None]

iter_structure_sequences()[source]
Returns:

Iterate over ChainStructure._seq attributes.

Return type:

abc.Generator[ChainSequence, None, None]

iter_structures()[source]
Returns:

An generator over ChainStructure’s.

Return type:

abc.Generator[ChainStructure, None, None]

sort(key=<function ChainList.<lambda>>)[source]
Return type:

ChainList[CT]

summary(**kwargs)[source]
Return type:

DataFrame

property categories: Set[str]
Returns:

A set of categories inferred from meta of encompassed objects.

property ids: list[str]
Returns:

A list of ids for all chains in this list.

property sequences: ChainList[ChainSequence]
Returns:

Get all lXtractor.core.chain.Chain._seq or lXtractor.core.chain.sequence.ChainSequence objects within this chain list.

property structure_sequences: ChainList[ChainSequence]
property structures: ChainList[ChainStructure]
lXtractor.chain.list.add_category(c, cat)[source]
Parameters:
  • c (Any) – A Chain*-type object.

  • cat (str) – Category name.

Returns:

lXtractor.chain.io module

class lXtractor.chain.io.ChainIO(num_proc=1, verbose=False, tolerate_failures=False)[source]

Bases: object

A class handling reading/writing collections of Chain* objects.

__init__(num_proc=1, verbose=False, tolerate_failures=False)[source]
Parameters:
  • num_proc (int) – The number of parallel processes. Using more processes is especially beneficial for ChainStructure’s and Chain’s with structures. Otherwise, the increasing this number may not reduce or actually worsen the time needed to read/write objects.

  • verbose (bool) – Output logging and progress bar.

  • tolerate_failures (bool) – Errors when reading/writing do not raise an exception.

read(obj_type, path, callbacks=(), **kwargs)[source]

Read obj_type-type objects from a path or an iterable of paths.

Parameters:
  • obj_type (Type[CT]) – Some class with @classmethod(read(path)).

  • path (Path | Iterable[Path]) – Path to the dump to read from. It’s a path to directory holding files necessary to init a given obj_type, or an iterable over such paths.

  • callbacks (Sequence[Callable[[CT], CT]]) – Callables applied sequentially to parsed object.

  • kwargs – Passed to the object’s read() method.

Returns:

A generator over initialized objects or futures.

Return type:

Generator[CT | None, None, None]

read_chain(path, **kwargs)[source]

Read Chain’s from the provided path.

If path contains signature files and directories (such as sequence.tsv and segments), it is assumed to contain a single object. Otherwise, it is assumed to contain multiple Chain objects.

Parameters:
  • path (Path | Iterable[Path]) – Path to a dump or a dir of dumps.

  • kwargs – Passed to read().

Returns:

An iterator over Chain objects.

Return type:

Generator[Chain | None, None, None]

read_chain_seq(path, **kwargs)[source]

Read ChainSequence’s from the provided path.

If path contains signature files and directories (such as sequence.tsv and segments), it is assumed to contain a single object. Otherwise, it is assumed to contain multiple ChainSequence objects.

Parameters:
  • path (Path | Iterable[Path]) – Path to a dump or a dir of dumps.

  • kwargs – Passed to read().

Returns:

An iterator over ChainSequence objects.

Return type:

Generator[ChainSequence | None, None, None]

read_chain_str(path, **kwargs)[source]

Read ChainStructure’s from the provided path.

If path contains signature files and directories (such as structure.cif and segments), it is assumed to contain a single object. Otherwise, it is assumed to contain multiple ChainStructure objects.

Parameters:
  • path (Path | Iterable[Path]) – Path to a dump or a dir of dumps.

  • kwargs – Passed to read().

Returns:

An iterator over ChainStructure objects.

Return type:

Generator[ChainStructure | None, None, None]

write(chains, base, overwrite=False, **kwargs)[source]
Parameters:
  • chains (CT | Iterable[CT]) – A single or multiple chains to write.

  • base (Path) – A writable dir. For multiple chains, will use base/chain.id directory.

  • overwrite (bool) – If the destination folder exists, False means returning the destination path without attempting to write the chain, whereas True results in an explicit .write() call.

  • kwargs – Passed to a chain’s write method.

Returns:

Whatever write method returns.

Return type:

Generator[Path | None | Future, None, None]

num_proc

The number of parallel processes

tolerate_failures

Errors when reading/writing do not raise an exception.

verbose

Output logging and progress bar.

class lXtractor.chain.io.ChainIOConfig(num_proc: 'int' = 1, verbose: 'bool' = False, tolerate_failures: 'bool' = False)[source]

Bases: object

__init__(num_proc=1, verbose=False, tolerate_failures=False)
num_proc: int = 1
tolerate_failures: bool = False
verbose: bool = False
lXtractor.chain.io.read_chains(paths, children, *, seq_cfg=ChainIOConfig(num_proc=1, verbose=False, tolerate_failures=False), str_cfg=ChainIOConfig(num_proc=1, verbose=False, tolerate_failures=False), seq_callbacks=(), str_callbacks=(), seq_kwargs=None, str_kwargs=None)[source]

Reads saved lXtractor.core.chain.chain.Chain objects without invoking lXtractor.core.chain.chain.Chain.read(). Instead, it will use separate ChainIO instances to read chain sequences and chain structures. The output is identical to ChainIO.read_chain_seq().

Consider using it for:

  1. For parallel parsing of Chain objects with many structures.

  2. For separate treatment of chain sequences and chain structures.

  3. For better customization of chain sequences and structures parsing.

Parameters:
  • paths (Path | Sequence[Path]) – A path or a sequence of paths to chains.

  • children (bool) – Search for, parse and integrate all nested children.

  • seq_cfg (ChainIOConfig) – ChainIO config for chain sequences parsing.

  • str_cfg (ChainIOConfig) – … for chain structures parsing.

  • seq_callbacks (Sequence[Callable[[CT], CT]]) – A (potentially empty) sequence passed to the reader. Each callback must accept and return a single chain sequence.

  • str_callbacks (Sequence[Callable[[CT], CT]]) – … Same for the structures.

  • seq_kwargs (dict[str, Any] | None) – Passed to lXtractor.core.chain.sequence.ChainSequence.read().

  • str_kwargs (dict[str, Any] | None) – Passed to lXtractor.core.chain.structure.ChainStructure.read().

Returns:

A chain list of parsed chains.

Return type:

ChainList[Chain]

lXtractor.chain.initializer module

A module encompassing the ChainInitializer used to init Chain*-type objects from various input types. It enables parallelization of reading structures and seq2seq mappings and is flexible thanks to callbacks.

class lXtractor.chain.initializer.ChainInitializer(tolerate_failures=False, verbose=False)[source]

Bases: object

In contrast to ChainIO, this object initializes new Chain, ChainStructure, or Chain objects from various input types.

To initialize Chain objects, use from_mapping().

To initialize ChainSequence or ChainStructure objects, use from_iterable().

__init__(tolerate_failures=False, verbose=False)[source]
Parameters:
  • tolerate_failures (bool) – Don’t stop the execution if some object fails to initialize.

  • verbose (bool) – Output progress bars.

from_iterable(it, num_proc=1, callbacks=None, desc='Initializing objects')[source]

Initialize ChainSequence`s or/and :class:`ChainStructure’s from (possibly heterogeneous) iterable.

Parameters:
  • it (abc.Iterable[ChainSequence | ChainStructure | Path | tuple[Path, abc.Sequence[str]] | tuple[str, str] | GenericStructure]) –

    Supported elements are:
    1. Initialized objects (passed without any actions).

    2. Path to a sequence or a structure file.

    3. (Path to a structure file, list of target chains).

    4. A pair (header, _seq) to initialize a ChainSequence.

    5. A GenericStructure with a single chain.

  • num_proc (int) – The number of processes to use.

  • callbacks (abc.Sequence[SingletonCallback] | None) – A sequence of callables accepting and returning an initialized object.

  • desc (str) – Progress bar description used if verbose is True.

Returns:

A generator yielding initialized chain sequences and structures parsed from the inputs.

Return type:

abc.Generator[_O | Future, None, None]

from_mapping(m, key_callbacks=None, val_callbacks=None, item_callbacks=None, *, map_numberings=True, num_proc_read_seq=1, num_proc_read_str=1, num_proc_item_callbacks=1, num_proc_map_numbering=1, num_proc_add_structure=1, **kwargs)[source]

Initialize Chain’s from mapping between sequences and structures.

It will first initialize objects to which the elements of m refer (see below) and then create maps between each sequence and associated structures, saving these into structure ChainStructure._seq’s.

Note

key/value_callback are distributed to parser and applied right after parsing the object. As a result, their application will be parallelized depending on the``num_proc_read_seq`` and num_proc_read_str parameters.

Parameters:
  • m (abc.Mapping[ChainSequence | Chain | tuple[str, str] | Path, abc.Sequence[ChainStructure | GenericStructure | bst.AtomArray | Path | tuple[Path, abc.Sequence[str]]]]) –

    A mapping of the form {_seq => [structures]}, where _seq is one of:

    1. Initialized ChainSequence.

    2. A pair (header, _seq).

    3. A path to a fasta file containing a single sequence.

    While each structure is one of:

    1. Initialized ChainStructure.

    2. GenericStructure with a single chain.

    3. biotite.AtomArray corresponding to a single chain.

    4. A path to a structure file.

    5. (A path to a structure file, list of target chains).

    In the latter two cases, the chains will be expanded and associated with the same sequence.

  • key_callbacks (abc.Sequence[SingletonCallback] | None) – A sequence of callables accepting and returning a ChainSequence.

  • val_callbacks (abc.Sequence[SingletonCallback] | None) – A sequence of callables accepting and returning a ChainStructure.

  • item_callbacks (abc.Sequence[ItemCallback] | None) – A sequence of callables accepting and returning a parsed item – a tuple of Chain and a sequence of associated ChainStructure`s. Callbacks are applied sequentially to each item as a function composition in the supplied order (left to right). It the last callback returns ``None` as a first element or an empty list as a second element, such item will be filtered out. Item callbacks are applied after parsing sequences and structures and converting chain sequences to chains.

  • map_numberings (bool) – Map PDB numberings to canonical sequence’s numbering via pairwise sequence alignments.

  • num_proc_read_seq (int) – A number of processes to devote to sequence parsing. Typically, sequence reading doesn’t benefit from parallel processing, so it’s better to leave this default.

  • num_proc_read_str (int) – A number of processes dedicated to structures parsing.

  • num_proc_item_callbacks (int) – A number of CPUs to parallelize item callbacks’ application.

  • num_proc_map_numbering (int) – A number of processes to use for mapping between numbering of sequences and structures. Generally, this should be as high as possible for faster processing. In contrast to the other operations here, this one seems more CPU-bound and less resource hungry (although, keep in mind the size of the canonical sequence: if it’s too high, the RAM usage will likely explode). If None, will default to num_proc.

  • num_proc_add_structure (int) – In case of parallel numberings mapping, i.e, when num_proc_map_numbering > 1, this option allows to transfer these numberings and add structures to chains in parallel. It may be useful to when add_to_children=True is passed in kwargs as it allows creating sub-structures in parallel.

  • kwargs – Passed to Chain.add_structure().

Returns:

A list of initialized chains.

Return type:

ChainList[Chain]

property supported_seq_ext: list[str]
Returns:

Supported sequence file extensions.

property supported_str_ext: list[str]
Returns:

Supported structure file extensions.

class lXtractor.chain.initializer.ItemCallback(*args, **kwargs)[source]

Bases: Protocol

A callback applied to processed items in ChainInitializer.from_mapping().

__call__(inp)[source]

Call self as a function.

Return type:

tuple[Chain | None, list[ChainStructure]]

__init__(*args, **kwargs)
class lXtractor.chain.initializer.SingletonCallback(*args, **kwargs)[source]

Bases: Protocol

A protocol defining signature for a callback used with ChainInitializer on single objects right after parsing.

__call__(inp: CT) CT | None[source]
__call__(inp: list[ChainStructure]) list[ChainStructure] | None
__call__(inp: None) None

Call self as a function.

__init__(*args, **kwargs)

lXtractor.chain.tree module

A module to handle the ancestral tree of the Chain*-type objects defined by their parent/children attributes and/or meta info.

lXtractor.chain.tree.list_ancestors(c)[source]
>>> o = ChainSequence.from_string('x' * 5, 1, 5, 'C')
>>> c13 = o.spawn_child(1, 3)
>>> c12 = c13.spawn_child(1, 2)
>>> list_ancestors(c12)
[C|1-3<-(C|1-5), C|1-5]
Parameters:

c (Chain | ChainSequence | ChainStructure) – Chain*-type object.

Returns:

A list ancestor objects obtained from the parent attribute..

Return type:

list[Chain | ChainSequence | ChainStructure]

lXtractor.chain.tree.list_ancestors_names(id_or_chain)[source]
>>> list_ancestors_names('C|1-5<-(C|1-3<-(C|1-2))')
['C|1-3', 'C|1-2']
Parameters:

id_or_chain (Chain | ChainSequence | ChainStructure | str) – Chain*-type object or its id.

Returns:

A list of parents ‘{name}|{start}-{end}’ representations parsed from the object’s id.

Return type:

list[str]

lXtractor.chain.tree.make(chains, connect=False, objects=False, check_is_tree=True)[source]

Make an ancestral tree – a directed graph representing ancestral relationships between chains.

Parameters:
  • chains (Iterable[Chain | ChainSequence | ChainStructure]) – An iterable of Chain*-type objects.

  • connect (bool) – Connect actual objects by populating .children and .parent attributes.

  • objects (bool) – Create an object tree using make_obj_tree(). Otherwise, create a “string” tree using make_str_tree(). Check the docs of these functions to understand the differences.

  • check_is_tree (bool) – If True, check if the obtained graph is actually a tree. If it’s not, raise ValueError.

Returns:

Return type:

DiGraph

lXtractor.chain.tree.make_filled(name, _t)[source]

Make a “filled” version of an object to occupy the tree.

Parameters:
  • name (str) – Name of the node obtained via node_name().

  • _t (CT | Type[CT]) – Some Chain*-type object.

Returns:

An object with filled sequence. If it’s a ChainStructure object, it will have an empty structure.

Return type:

CT

lXtractor.chain.tree.make_obj_tree(chains, connect=False, check_is_tree=True)[source]

Make an ancestral tree – a directed graph representing ancestral relationships between chains. The nodes of the tree are Chain*-type objects. Hence, they must be hashable. This restricts types of sequences valid for ChainSequence to abc.Sequence[abc.Hashable].

As a useful side effect, this function can aid in filling the gaps in the actual tree indicated by the id-relationship suggested by the “id” field of the meta property. In other words, if a segment S|1-2 was obtained by spawning from S|1-5, S|1-2’s id will reflect this:

>>> s = make_filled('S|1-5', ChainSequence.make_empty())
>>> c12 = s.spawn_child(1, 2)
>>> c12
S|1-2<-(S|1-5)

However, if S|1-5 was lost (e.g., by writing/reading S|1-2 to/from disk), and S|1-2.parent is None, we can use ID stored in meta to recover ancestral relationships. This function will attend to such cases and create a filler object S|1-5 with a “*”-filled sequence.

>>> c12.parent = None
>>> c12
S|1-2
>>> c12.meta['id']
'S|1-2<-(S|1-5)'
>>> ct = make_obj_tree([c12],connect=True)
>>> assert len(ct.nodes) == 2
>>> [n.id for n in ct.nodes]
['S|1-2<-(S|1-5)', 'S|1-5']
Parameters:
  • chains (Iterable[CT]) – A homogeneous iterable of Chain*-type objects.

  • connect (bool) – If True, connect both supplied and created filler objects via children and parent attributes.

  • check_is_tree (bool) – If True, check if the obtained graph is actually a tree. If it’s not, raise ValueError.

Returns:

A networkx’s directed graph with Chain*-type objects as nodes.

Return type:

DiGraph

lXtractor.chain.tree.make_str_tree(chains, connect=False, check_is_tree=True)[source]

A computationally cheaper alternative to make_obj_tree(), where nodes are string objects, while actual objects reside in a node attribute “objs”. It allows for a faster tree construction since it avoids expensive hashing of Chain*-type objects.

Parameters:
  • chains (Iterable[Chain | ChainSequence | ChainStructure]) – An iterable of Chain*-type objects.

  • connect (bool) – If True, connect both supplied and created filler objects via children and parent attributes.

  • check_is_tree (bool) – If True, check if the obtained graph is actually a tree. If it’s not, raise ValueError.

Returns:

A networkx’s directed graph.

Return type:

DiGraph

lXtractor.chain.tree.recover(c)[source]

Recover ancestral relationships of a Chain*-type object. This will use make_str_tree() to recover ancestors from object IDs of an object itself and any encompassed children.

..note ::

It may be used as a callback in lXtractor.chain.io.ChainIO.read()

..note ::

make_str_tree() creates “filled” parents via make_filled()

Parameters:

c (Chain | ChainSequence | ChainStructure) – A Chain*-type object.

Returns:

The same object with populated children and parent attributes.

Return type:

Chain | ChainSequence | ChainStructure