lXtractor.chain package
lXtractor.chain.base module
- lXtractor.chain.base.is_chain_type_iterable(s)[source]
- Return type:
t.TypeGuard[abc.Iterable[Chain] | abc.Iterable[ChainSequence] | abc.Iterable[ChainStructure]]
- lXtractor.chain.base.topo_iter(start_obj, iterator)[source]
Iterate over sequences in topological order.
>>> n = 1 >>> it = topo_iter(n, lambda x: (x + 1 for n in range(x))) >>> next(it) [2] >>> next(it) [3, 3]
- Parameters:
start_obj (T) – Starting object.
iterator (Callable[[T], Iterable[T]]) – A callable accepting a single argument of the same type as the start_obj and returning an iterator over objects with the same type, representing the next level.
- Returns:
A generator yielding lists of objects obtained using iterator and representing topological levels with the root in start_obj.
- Return type:
Generator[list[T], None, None]
lXtractor.chain.sequence module
- class lXtractor.chain.sequence.ChainSequence(start, end, name='S', seqs=None, parent=None, children=None, meta=None, variables=None)[source]
Bases:
Segment
A class representing polymeric sequence of a single entity (chain).
The sequences are stored internally as a dictionary {seq_name => _seq} and must all have the same length. Additionally, seq_name must be a valid field name: something one could use in namedtuples. If unsure, please use
lXtractor.util.misc.is_valid_field_name()
for testing.A single gap-less primary sequence (
seq1()
) is mandatory during the initialization. We refer to the sequences other thanseq1()
as “maps.” To view the standard sequence names supported byChainSequence
, use theflied_names()
property.The sequence can be a part of a larger one. The child-parent relationships are indicated via
parent
and attr:children, where the latter entails any sub-sequence. A preferable way to create subsequences is thespawn_child()
method.>>> seqs = { ... 'seq1': 'A' * 10, ... 'A': ['A', 'N', 'Y', 'T', 'H', 'I', 'N', 'G', '!', '?'] ... } >>> cs = ChainSequence(1, 10, 'CS', seqs=seqs) >>> cs CS|1-10 >>> assert len(cs) == 10 >>> assert 'A' in cs and 'seq1' in cs >>> assert cs.seq1 == 'A' * 10
- apply_children(fn, inplace=False)[source]
Apply some function to children.
- Parameters:
fn (ApplyT[ChainSequence]) – A callable accepting and returning the chain sequence type instance.
inplace (bool) – Apply to children in place. Otherwise, return a copy with only children transformed.
- Returns:
A chain sequence with transformed children.
- Return type:
t.Self
- apply_to_map(map_name, fn, inplace=False, preserve_children=False, apply_to_children=False)[source]
Apply some function to map/sequence in this chain sequence.
- Parameters:
map_name (str) – Name of the internal sequence/map.
fn (ApplyT[abc.Sequence]) – A function accepting and returning a sequence of the same length.
inplace (bool) – Apply the operation to this object. Otherwise, create a copy with the transformed sequence.
preserve_children (bool) – Preserve
children
of this instance in the transformed object. PassingTrue
makes sense if the target sequence is mutable: the children’s will be transformed naturally. In the target sequence is immutable, consider passingTrue
withapply_to_children=True
.apply_to_children (bool) – Recursively apply the same fn to a child tree starting from this instance. If passed, sets
preserve_children=True
: otherwise, one is at risk of removing allchildren
in the child tree of the returned instance.
- Returns:
- Return type:
t.Self
- as_chain(transfer_children=True, structures=None, **kwargs)[source]
Convert this chain sequence to chain.
Note
Pass
add_to_children=True
to transfer structure to each child iftransfer_children=True
.- Parameters:
transfer_children (bool) – Transfer existing children.
structures (abc.Sequence[ChainStructure] | None) – Add structures to the created chain.
kwargs – Passed to
Chain.add_structure
- Returns:
- Return type:
- as_df()[source]
- Returns:
The pandas DataFrame representation of the sequence where each column correspond to a sequence or map.
- Return type:
DataFrame
- as_np()[source]
- Returns:
The numpy representation of a sequence as matrix. This is a shortcut to
as_df()
and getting df.values.- Return type:
ndarray
- coverage(map_names=None, save=True, prefix='cov')[source]
Calculate maps’ coverage, i.e., the number of non-empty elements.
- Parameters:
map_names (Sequence[str] | None) – optionally, provide the sequence of map names to calculate the coverage for.
save (bool) – save the results to
meta
prefix (str) – if save is
True
, format keys f”{prefix}_{name}” for themeta
dictionary.
- Returns:
- Return type:
dict[str, float]
- fill(other, template, target, link_name, link_points_to, keep=True, target_new_name=None, empty_template=(None, ), empty_target=(None, ), transform=<function identity>)[source]
Fill-in a sequence in other using a template sequence from here.
As an example, consider two related sequences,
s
ando
, mapped to the same reference numbering schemer
, which we’ll denote as a “link sequence.”We would like to fill in “X” residues within
o
with residues froms
. Let’s first try this:>>> s = ChainSequence.from_string('ABCD', r=[10, 11, 12, 13]) >>> o = ChainSequence.from_string('AABXDE', r=[9, 10, 11, 12, 13, 14]) >>> s.fill(o,'seq1','seq1','r','r') ['A', 'A', 'B', 'X', 'D', 'E']
In the example above, “X” was not replaced because it’s not considered and “empty” target element requiring replacement. Below, we’ll provide a tuple of possible empty values and pass a transform function that will join the result back into
str
.>>> s.fill(o,'seq1','seq1','r','r',empty_target=('X', ),transform="".join) 'AABCDE' >>> o['seq1_patched'] == 'AABCDE' True
- Parameters:
other (t.Self) – Some other chain sequence.
template (str) – The name of the template sequence.
target (str) – Target sequence name within other to patch.
link_name (str) – Name of the map within other that links it with this sequence.
link_points_to (str | None) – Name of the map within this chain sequence that corresponding to link_name within other. If
None
, it is assumed to be the same as link_name.keep (bool) – Keep patched sequence within other.
target_new_name (str | None) – Name of the patched sequence to save within other if keep is
True
. If this or target names are “seq1”, will use “seq1_patched” as target_new_name as this sequence is considered immutable by convention.empty_target (tuple[t.Any, ...] | abc.Callable[[T], bool]) – A tuple of element instances or a callable. If tuple, a target element will be replaced with the corresponding element from`template` if it’s within this tuple. If callable, should accept an element of the target sequence and output
True
if it should be replaced with an element from the template andFalse
otherwise.empty_template (tuple[t.Any, ...] | abc.Callable[[T], bool]) – Same as empty_target but applied to a template character, with reverse meaning for
True
andFalse
of the empty_target param.transform (abc.Callable[[list[T]], abc.Sequence[R]]) – A function that transforms the result from one sequence to another.
- Returns:
A patched mapping/sequence after applying the transform function.
- Return type:
abc.Sequence[R]
- filter_children(pred, inplace=False)[source]
Filter children using some predicate.
- Parameters:
pred (FilterT[ChainSequence]) – Some callable accepting chain sequence and returning bool.
inplace (bool) – Filter
children
in place. Otherwise, return a copy with only children transformed.
- Returns:
A chain sequence with filtered children.
- Return type:
t.Self
- classmethod from_df(df, name='S', meta=None)[source]
Init sequence from a data frame.
- Parameters:
df (Path | pd.DataFrame) – Path to a tsv file or a pandas DataFrame.
name (str) – Name of a new chain sequence.
meta (dict[str, t.Any] | None) – Meta info of a new chain sequence.
- Returns:
Initialized chain sequence.
- Return type:
t.Self
- classmethod from_file(inp, reader=<function read_fasta>, start=None, end=None, name=None, meta=None, **kwargs)[source]
Initialize chain sequence from file.
- Parameters:
inp (Path | TextIOBase | Iterable[str]) – Path to a file or file handle or iterable over file lines.
reader (SeqReader) – A function to parse the sequence from inp.
start (int | None) – Start coordinate of a sequence in a file. If not provided, assumed to be 1.
end (int | None) – End coordinate of a sequence in a file. If not provided, will evaluate to the sequence’s length.
name (str | None) – Name of a sequence in inp. If not provided, will evaluate to a sequence’s header.
meta (dict[str, Any] | None) – Meta-info to add for the sequence.
kwargs – Additional sequences other than seq1 (as used during initialization via _seq attribute).
- Returns:
Initialized chain sequence.
- Return type:
- classmethod from_string(s, start=None, end=None, name='S', meta=None, **kwargs)[source]
Initialize chain sequence from string.
- Parameters:
s (str) – String to init from.
start (int | None) – Start coordinate (default=1).
end (int | None) – End coordinate(default=len(s)).
name (str) – Name of a new chain sequence.
meta (dict[str, Any] | None) – Meta info of a new sequence.
kwargs – Additional sequences other than seq1 (as used during initialization via _seq attribute).
- Returns:
Initialized chain sequence.
- Return type:
- get_closest(key, value, *, reverse=False)[source]
Find the closest item for which item.key
>=/<=
value. By default, the search starts from the sequence’s beginning, and expands towards the end until the first element for which the retrieved value >= the provided value. If the reverse isTrue
, the search direction is reversed, and the comparison operator becomes<=
>>> s = ChainSequence(1, 4, 'CS', seqs={'seq1': 'ABCD', 'X': [5, 6, 7, 8]}) >>> s.get_closest('seq1', 'D') Item(i=4, seq1='D', X=8) >>> s.get_closest('X', 0) Item(i=1, seq1='A', X=5) >>> assert s.get_closest('X', 0, reverse=True) is None
- Parameters:
key (str) – map name.
value (Ord) – map value. Must support comparison operators.
reverse (bool) – reverse the sequence order and the comparison operator.
- Returns:
The first relevant item or None if no relevant items were found.
- Return type:
NamedTupleT | None
- get_item(key, value)[source]
Get a specific item. Same as
get_map()
, but uses value to retrieve the needed item immediately.(!) Use it when a single item is needed. For multiple queries for the same sequence, please use
get_map()
.>>> s = ChainSequence.from_string('ABC', name='CS') >>> s.get_item('seq1', 'B').i 2
- Parameters:
key (str) – map name.
value (Any) – sequence value of the sequence under the key name.
- Returns:
an item correpsonding to the desired sequence element.
- Return type:
- get_map(key, to=None, rm_empty=False)[source]
Obtain the mapping of the form “key->item(seq_name=*,…)”.
>>> s = ChainSequence.from_string('ABC', name='CS') >>> s.get_map('i') {1: Item(i=1, seq1='A'), 2: Item(i=2, seq1='B'), 3: Item(i=3, seq1='C')} >>> s.get_map('seq1') {'A': Item(i=1, seq1='A'), 'B': Item(i=2, seq1='B'), 'C': Item(i=3, seq1='C')} >>> s.add_seq('S', [1, 2, np.nan]) >>> s.get_map('seq1', 'S', rm_empty=True) {'A': 1, 'B': 2}
- Parameters:
key (str) – A _seq name to map from.
to (str | None) – A _seq name to map to.
rm_empty (bool) – Remove empty keys and values. A numeric value is empty if it is of type NaN. A string value is empty if it is an empty string (
""
).
- Returns:
dict mapping key values to items.
- Return type:
dict[Hashable, Any]
- iter_children()[source]
Iterate over a child tree in topological order.
>>> s = ChainSequence(1, 10, 'CS', seqs={'seq1': 'A' * 10}) >>> ss = s.spawn_child(1, 5, 'CS_') >>> sss = ss.spawn_child(1, 3, 'CS__') >>> list(s.iter_children()) [[CS_|1-5<-(CS|1-10)], [CS__|1-3<-(CS_|1-5<-(CS|1-10))]]
- Returns:
a generator over child tree levels, starting from the
children
and expanding such attributes overChainSequence
instances within this attribute.- Return type:
Generator[ChainList[ChainSequence], None, None]
- map_boundaries(start, end, map_name, closest=False)[source]
Map the provided boundaries onto sequence.
A convenient interface for common task where one wants to find sequence elements corresponding to arbitrary boundaries.
>>> s = ChainSequence.from_string('XXSEQXX', name='CS') >>> s.add_seq('NCS', list(range(10, 17))) >>> s.map_boundaries(1, 3, 'i') (Item(i=1, seq1='X', NCS=10), Item(i=3, seq1='S', NCS=12)) >>> s.map_boundaries(5, 12, 'NCS', closest=True) (Item(i=1, seq1='X', NCS=10), Item(i=3, seq1='S', NCS=12))
- Parameters:
- Returns:
a tuple with two items corresponding to mapped start and end.
- Return type:
tuple[NamedTupleT, NamedTupleT]
- map_numbering(other, align_method=<function mafft_align>, save=True, name='S', **kwargs)[source]
Map the
numbering()
: of another sequence onto this one. For this, align primary sequences and relate their numbering.>>> s = ChainSequence.from_string('XXSEQXX', name='CS') >>> o = ChainSequence.from_string('SEQ', name='CSO') >>> s.map_numbering(o) [None, None, 1, 2, 3, None, None] >>> assert 'map_CSO' in s >>> a = Alignment([('CS1', 'XSEQX'), ('CS2', 'XXEQX')]) >>> s.map_numbering(a, name='map_aln') [None, 1, 2, 3, 4, 5, None] >>> assert 'map_aln' in s
- Parameters:
other (str | tuple[str, str] | ChainSequence | Alignment) – another chain _seq.
align_method (AlignMethod) – a method to use for alignment.
save (bool) – save the numbering as a sequence.
name (str) – a name to use if save is
True
.kwargs – passed to func:map_pairs_numbering.
- Returns:
a list of integers with
None
indicating gaps.- Return type:
list[None | int]
- match(map_name1, map_name2, as_fraction=True, save=True, name='auto')[source]
- Parameters:
map_name1 (str) – Mapping name 1.
map_name2 (str) – Mapping name 2.
as_fraction (bool) – Divide by the total length.
save (bool) – Save the result to
meta
.name (str) – Name of the saved metadata entry. If “auto”, will derive from given map names.
- Returns:
The total number or a fraction of matching characters between maps.
- Return type:
float
- patch(other, numerator, link_name, link_points_to, diff=<built-in function sub>, num_filter=<function ChainSequence.<lambda>>, **kwargs)[source]
Patch the gaps in the provided sequence using this sequence as template.
The existence of a gap is judged by the numerator map that should point to a numeration scheme. If there are two consecutive numerator elements, for which diff returns value greater than one, this is considered a gap that could be filled in by a template.
To relate a potential gap to the template sequence, a link sequence must exist in the provided sequence, containing values referencing the template.
As an example, consider the template sequence “ABCDEG” and the sequence requiring patching “BDEG”. Let
e
be the numbering of the “BDEG”,e=[1, 4, 5, 6]
andr=[2, 4, 5, 6]
be a link map that points to the segment indices of the template.>>> template = ChainSequence.from_string("ABCDEG", name='T') >>> seq = ChainSequence.from_string("BDEG", name='P', e=[1,4,6,7], r=[2,4,5,6])
Observe that there is a numeration gap between
1
and4
. The corresponding elements ofr
point to the template indices2
an4
. Thus, there is a gap that can be filled in by a portion of the template between2
and4
. Here, it turns out to be singleton sequence element “C” at position3
. This segment will be inserted into the patched sequence:>>> patched = template.patch(seq,'e','r','i') >>> patched.id 'P|1-5' >>> patched.seq1 'BCDEG'
Similar to
patch()
, the sequence elements missing in either of the sequences will be filled-in. Thus, what happens to the original numeratione
?>>> patched['e'] [1, None, 4, 6, 7]
On the other hand, the link sequence
r
can be successfully filled in by the template:>>> patched['r'] [2, 3, 4, 5, 6]
Note
If this segment is empty or singleton, the other is returned unchanged.
Warning
This operation creates a new segment. The parents and metadata won’t be transferred.
See also
lXtractor.core.segment.Segment.insert()
used to insert segments while patching.- Parameters:
other (t.Self) – A sequence to patch.
numerator (str) – A map name in other containing numeration scheme the gaps will be inferred from.
link_name (str) – A map name in other with values referencing some sequence in this instance.
link_points_to (str) – A map name in this instance that the link_name refers to in other.
diff (abc.Callable[[T, T], int]) – A callable accepting two numerator elements – higher and lower ones – and returning the number of elements between them. By default, a simple substraction is used.
num_filter (abc.Callable[[t.Any], bool]) – An optional filter function to filter out elements in the numerator before splitting it into consecutive pairs. By default, this function will filter out any
None
values.kwargs – Additional keyword arguments passed to meth:lXtractor.core.segment.Segment.insert.
- Returns:
A new patched segment.
- Return type:
t.Self
- classmethod read(base_dir, *, search_children=False)[source]
Initialize chain sequence from dump created using
write()
.- Parameters:
base_dir (Path) – A path to a dump dir.
search_children (bool) – Recursively search for child segments and populate the
children
- Returns:
Initialized chain sequence.
- Return type:
t.Self
- relate(other, map_name, link_name, link_points_to='i', keep=True, map_name_in_other=None)[source]
Relate mapping from this sequence with other via some common “link” sequence.
The “link” sequence is a part of the other pointing to some sequence within this instance.
As an example, consider the case of transferring the mapping to alignment positions aln_map. To do this, the other must be mapped to some sequence within this instance – typically to canonical numbering – via some stored map_canonical sequence.
Thus, one would use ..code-block:: python
- this.relate(
other, map_name=aln_map, link_name=map_canonical, link_name_points_to=”i”
)
In the example below, we transfer map_some sequence from s to o via sequence L pointing to the primary sequence of s:
seq1 : A B C D ---| map_some: 9 8 7 6 | --> 9 8 None 6 (map transferred to `o`) | | | | | seq1 : X Y Z R | L : A B X D ---|
>>> s = ChainSequence.from_string('ABCD', name='CS') >>> s.add_seq('map_some', [9, 8, 7, 6]) >>> o = ChainSequence.from_string('XYZR', name='XY') >>> o.add_seq('L', ['A', 'B', 'X', 'D']) >>> assert 'L' in o >>> s.relate(o,map_name='map_some', link_name='L', link_points_to='seq1') [9, 8, None, 6] >>> assert o['map_some'] == [9, 8, None, 6]
- Parameters:
other (t.Self) – An arbitrary chain sequence.
map_name (str) – The name of the sequence to transfer.
link_name (str) – The name of the “link” sequence that connects self and other.
link_points_to (str) – Values within this instance the “link” sequence points to.
keep (bool) – Store the obtained sequence within the other.
map_name_in_other (str | None) – The name of the mapped sequence to store within the other. By default, the map_name is used.
- Returns:
The mapped sequence.
- Return type:
list[t.Any]
- rename(name)[source]
Rename this sequence by modifying the
name
.Note
This is a mutable operation. Returning a copy of this sequence upon renaming will create two identical sequences with different IDs, which is discouraged.
- Parameters:
name (str) – New name.
- Returns:
The same sequence with a new name.
- Return type:
t.Self
- spawn_child(start, end, name=None, category=None, *, map_from=None, map_closest=False, deep_copy=False, keep=True)[source]
Spawn the sub-sequence from the current instance.
Child sequence’s boundaries must be within this sequence’s boundaries.
Uses
Segment.sub()
method.>>> s = ChainSequence( ... 1, 4, 'CS', ... seqs={'seq1': 'ABCD', 'X': [5, 6, 7, 8]} ... ) >>> child1 = s.spawn_child(1, 3, 'Child1') >>> assert child1.id in s.children >>> s.children [Child1|1-3<-(CS|1-4)]
- Parameters:
start (int) – Start of the sub-sequence.
end (int) – End of the sub-sequence.
name (str | None) – Spawned child sequence’s name.
category (str | None) – Spawned child category. Any meaningful tag string that could be used later to group similar children.
map_from (str | None) – Optionally, the map name the boundaries correspond to.
map_closest (bool) – Map to closest start, end boundaries (see
map_boundaries()
).deep_copy (bool) – Deep copy inherited sequences.
keep (bool) – Save child sequence within
children
.
- Returns:
Spawned sub-sequence.
- Return type:
- write(dest, *, write_children=False)[source]
Dump this chain sequence. Creates sequence.tsv and meta.tsv in base_dir using
write_seq()
andwrite_meta()
.- Parameters:
dest (Path) – Destination directory.
write_children (bool) – Recursively write children.
- Returns:
Path to the directory where the files are written.
- Return type:
Path
- write_meta(path, sep='\t')[source]
Write meta information as {key}{sep}{value} lines.
- Parameters:
path (Path) – Write destination file.
sep – Separator between key and value.
- Returns:
Nothing.
- write_seq(path, fields=None, sep='\t')[source]
Write the sequence (and all its maps) as a table.
- Parameters:
path (Path) – Write destination file.
fields (list[str] | None) – Optionally, names of sequences to dump.
sep (str) – Table separator. Please use the default to avoid ambiguities and keep readability.
- Returns:
Nothing.
- property categories: list[str]
- Returns:
A list of categories associated with this object.
Categories are kept under “category” field in
meta
as a “,”-separated list of strings. For instance, “domain,family_x”.
- property fields: tuple[str, ...]
- Returns:
Names of the currently stored sequences.
- property seq: t.Self
This property exists for functionality relying on the .seq attribute.
- Returns:
This object.
- property seq1: str
- Returns:
the primary sequence.
- property seq3: Sequence[str]
- Returns:
the three-letter codes of a primary sequence.
- lXtractor.chain.sequence.map_numbering_12many(obj_to_map, seqs, num_proc=1, verbose=False, **kwargs)[source]
Map numbering of a single sequence to many other sequences.
This function does not save mapped numberings.
See also
- Parameters:
obj_to_map (str | tuple[str, str] | ChainSequence | Alignment) – Object whose numbering should be mapped to seqs.
seqs (Iterable[ChainSequence]) – Chain sequences to map the numbering to.
num_proc (int) – A number of parallel processes to use.
verbose (bool) – Output progress bar.
kwargs – Passed to
lXtractor.util.misc.apply()
.
- Returns:
An iterator over the mapped numberings.
- Return type:
Iterator[list[int | None]]
- lXtractor.chain.sequence.map_numbering_many2many(objs_to_map, seq_groups, num_proc=1, verbose=False, **kwargs)[source]
Map numbering of each object o in objs_to_map to each sequence in each group of the seq_groups
o1 -> s1_1 s1_1 s1_3 ... o2 -> s2_1 s2_1 s2_3 ... ...
This function does not save mapped numberings.
For a single object-group pair, it’s the same as
map_numbering_12many()
. The benefit comes from parallelization of this functionality.- Parameters:
objs_to_map (Sequence[str | tuple[str, str] | ChainSequence | Alignment]) – An iterable over objects whose numbering to map.
seq_groups (Sequence[Sequence[ChainSequence]]) – Group of objects to map numbering to.
num_proc (int) – A number of processes to use.
verbose (bool) – Output a progress bar.
kwargs – Passed to
lXtractor.util.misc.apply()
.
- Returns:
An iterator over lists of lists with numeric mappings
- Return type:
Iterator[list[list[int | None]]]
[[s1_1 map, s1_2 map, ...] [s2_1 map, s2_2 map, ...] ... ]
lXtractor.chain.structure module
- class lXtractor.chain.structure.ChainStructure(structure, chain_id=None, structure_id=None, seq=None, parent=None, children=None, variables=None)[source]
Bases:
object
A structure of a single chain.
Typical usage workflow:
- Use :meth:`GenericStructure.read <lXtractor.core.structure.
GenericStructure.read>` to parse the file.
- Split into chains using :meth:`split_chains <lXtractor.core.structure.
GenericStructure.split_chains>`.
- Initialize
ChainStructure
from each chain via from_structure()
.
- Initialize
s = GenericStructure.read(Path("path/to/structure.cif")) chain_structures = [ ChainStructure.from_structure(c) for c in s.split_chains() ]
Two main containers are:
_seq
– aChainSequence
of this structure,also containing meta info.
pdb
– a container with pdb id, pdb chain id,and the structure itself.
A unique structure is defined by
- __init__(structure, chain_id=None, structure_id=None, seq=None, parent=None, children=None, variables=None)[source]
- Parameters:
structure_id (str | None) – An ID for the structure the chain was taken from.
chain_id (str | None) – A chain ID (e.g., “A”, “B”, etc.)
structure (GenericStructure | bst.AtomArray | None) – Parsed generic structure with a single chain.
seq (ChainSequence | None) – Chain sequence of a structure. If not provided, will use
get_sequence
.parent (ChainStructure | None) – Specify parental structure.
children (abc.Iterable[ChainStructure] | None) – Specify structures descended from this one. This contained is used to record sub-structures obtained via
spawn_child()
.variables (Variables | None) – Variables associated with this structure.
- Raises:
InitError – If invalid (e.g., multi-chain structure) is provided.
- apply_children(fn, inplace=False)[source]
Apply some function to children.
- Parameters:
fn (ApplyT[ChainStructure]) – A callable accepting and returning the chain structure type instance.
inplace (bool) – Apply to children in place. Otherwise, return a copy with only children transformed.
- Returns:
A chain structure with transformed children.
- Return type:
t.Self
- filter_children(pred, inplace=False)[source]
Filter children using some predicate.
- Parameters:
pred (FilterT[ChainStructure]) – Some callable accepting chain structure and returning bool.
inplace (bool) – Filter
children
in place. Otherwise, return a copy with only children transformed.
- Returns:
A chain structure with filtered children.
- Return type:
t.Self
- iter_children()[source]
Iterate
children
in topological order.See
ChainSequence.iter_children()
andtopo_iter()
.- Return type:
Generator[list[ChainStructure], None, None]
- classmethod make_empty()[source]
Create an empty chain structure.
- Returns:
An empty chain structure.
- Return type:
- classmethod read(base_dir, *, search_children=False, **kwargs)[source]
Read the chain structure from a file disk dump.
- Parameters:
base_dir (Path) – An existing dir containing structure, structure sequence, meta info, and (optionally) any sub-structure segments.
dump_names – File names container.
search_children (bool) – Recursively search for sub-segments and populate
children
.kwargs – Passed to
lXtractor.core.structure.GenericStructure.read()
.
- Returns:
An initialized chain structure.
- Return type:
t.Self
- rm_solvent(copy=False)[source]
Remove solvent “residues” from this structure.
- Parameters:
copy (bool) – Copy an atom array that results from solvent removal.
- Returns:
A new instance without solvent molecules.
- Return type:
t.Self
- spawn_child(start, end, name=None, category=None, *, map_from=None, map_closest=True, keep_seq_child=False, keep=True, deep_copy=False, tolerate_failure=False, silent=False)[source]
Create a sub-structure from this one. Start and end have inclusive boundaries.
- Parameters:
start (int) – Start coordinate.
end (int) – End coordinate.
name (str | None) – The name of the spawned sub-structure.
category (str | None) – Spawned child category. Any meaningful tag string that could be used later to group similar children.
map_from (str | None) – Optionally, the map name the boundaries correspond to.
map_closest (bool) – Map to closest start, end boundaries (see
map_boundaries()
).keep_seq_child (bool) – Keep spawned sub-sequence within
ChainSequence.children
. Beware that it’s best to use a single object type for keeping parent-children relationships to avoid duplicating information.keep (bool) – Keep spawned substructure in
children
.deep_copy (bool) – Deep copy spawned sub-sequence and sub-structure.
tolerate_failure (bool) – Do not raise the ``InitError` if the resulting structure subset is empty,
silent (bool) – Do not display warnings if tolerate_failure is
True
.
- Returns:
New chain structure – a sub-structure of the current one.
- Return type:
- superpose(other, res_id=None, atom_names=None, map_name_self=None, map_name_other=None, mask_self=None, mask_other=None, inplace=False, rmsd_to_meta=True)[source]
Superpose some other structure to this one. It uses func:biotite.structure.superimpose internally.
The most important requirement is both structures (after all optional selections applied) having the same number of atoms.
- Parameters:
other (ChainStructure) – Other chain structure (mobile).
res_id (Sequence[int] | None) – Residue positions within this or other chain structure. If
None
, use all available residues.atom_names (Sequence[Sequence[str]] | Sequence[str] | None) –
Atom names to use for selected residues. Two options are available:
1) Sequence of sequences of atom names. In this case, atom names are given per selected residue (res_id), and the external sequence’s length must correspond to the number of residues in the res_id. Note that if no res_id provided, the sequence must encompass all available residues.
2) A sequence of atom names. In this case, it will be used to select atoms for each available residues. For instance, use
atom_names=["CA", "C", "N"]
to select backbone atoms.map_name_self (str | None) – Use this map to map res_id to real numbering of this structure.
map_name_other (str | None) – Use this map to map res_id to real numbering of the other structure.
mask_self (ndarray | None) – Per-atom boolean selection mask to pick fixed atoms within this structure.
mask_other (ndarray | None) – Per-atom boolean selection mask to pick mobile atoms within the other structure. Note that mask_self and mask_other take precedence over other selection specifications.
inplace (bool) – Apply the transformation to the mobile structure inplace, mutating other. Otherwise, make a new instance: same as other, but with transformed atomic coordinates of a
pdb.structure
.rmsd_to_meta (bool) – Write RMSD to the
meta
of other as “rmsd
- Returns:
A tuple with (1) transformed chain structure, (2) transformation RMSD, and (3) transformation matrices (see func:biotite.structure.superimpose for details).
- Return type:
tuple[ChainStructure, float, tuple[ndarray, ndarray, ndarray]]
- write(dest, fmt='mmtf.gz', *, write_children=False)[source]
Write this object into a directory. It will create the following files:
meta.tsv
sequence.tsv
structure.fmt
Existing files will be overwritten.
- Parameters:
dest (Path) – A writable dir to save files to.
fmt (str) – Structure format to use. Supported formats are “pdb”, “cif”, and “mmtf”. Adding “.gz” (eg, “mmtf.gz”) will lead to gzip compression.
write_children (bool) – Recursively write
children
.
- Returns:
Path to the directory where the files are written.
- Return type:
Path
- property altloc: str
- Returns:
An altloc ID.
- property array: AtomArray
- Returns:
The
AtomArray
object (a shortcut for.pdb.structure.array
).
- property categories: list[str]
- Returns:
A list of categories encapsulated within
ChainSequence.meta
.
- property chain_id: str
- children: ChainList[ChainStructure]
Any sub-structures descended from this one, preferably using
spawn_child()
.
- property end: int
- Returns:
Structure sequence’s
end
- property id: str
- Returns:
ChainStructure identifier in the format “ChainStructure({_seq.id}|{alt_locs})<-(parent.id)”.
- property is_empty: bool
- Returns:
True
if the structure is empty andFalse
otherwise.
- property meta: dict[str, str]
- Returns:
Meta info of a
_seq
.
- property name: str | None
- Returns:
Structure sequence’s
name
- property parent: t.Self | None
- property seq: ChainSequence
- property start: int
- Returns:
Structure sequence’s
start
- property structure: GenericStructure
- variables: Variables
Variables assigned to this structure. Each should be of a
lXtractor.variables.base.StructureVariable
.
- lXtractor.chain.structure.filter_selection_extended(c, pos=None, atom_names=None, map_name=None, exclude_hydrogen=False, tolerate_missing=False)[source]
Get mask for certain positions and atoms of a chain structure.
- Parameters:
c (ChainStructure) – Arbitrary chain structure.
pos (Sequence[int] | None) – A sequence of positions.
atom_names (Sequence[Sequence[str]] | Sequence[str] | None) – A sequence of atom names (broadcasted to each position in res_id) or an iterable over such sequences for each position in res_id.
map_name (str | None) – A map name to map from pos to
numbering
exclude_hydrogen (bool) – For convenience, exclude hydrogen atoms. Especially useful during pre-processing for superposition.
tolerate_missing (bool) – If certain positions failed to map, does not raise an error.
- Returns:
A binary mask,
True
for selected atoms.- Return type:
ndarray
- lXtractor.chain.structure.subset_to_matching(reference, c, map_name=None, skip_if_match='seq1', **kwargs)[source]
Subset both chain structures to aligned residues using sequence alignment.
Note
It’s not necessary, but it makes sense for c1 and c2 to be somehow related.
- Parameters:
reference (ChainStructure) – A chain structure to align to.
c (ChainStructure) – A chain structure to align.
map_name (str | None) – If provided, c is considered “pre-aligned” to the reference, and reference possessed the numbering under map_name.
skip_if_match (str) –
Two options:
1. Sequence/Map name, e.g., “seq1” – if sequences under this name match exactly, skip alignment and return original chain structures.
2. “len” – if sequences have equal length, skip alignment and return original chain structures.
- Returns:
A pair of new structures having the same number of residues that were successfully matched during the alignment.
- Return type:
tuple[ChainStructure, ChainStructure]
lXtractor.chain.chain module
- class lXtractor.chain.chain.Chain(seq, structures=None, parent=None, children=None)[source]
Bases:
object
A container, encompassing a
ChainSequence
and possibly manyChainStructure
’s corresponding to a single protein chain.A typical use case is when one wants to benefit from the connection of structural and sequential data, e.g., using single full canonical sequence as
_seq
and all the associated structures withinstructures
. In this case, this data structure makes it easier to extract, annotate, and calculate variables using canonical sequence mapped to the sequence of a structure.Typical workflow:
Initialize from some canonical sequence.
Add structures and map their sequences.
???
- Do something useful, like calculate variables using canonical
sequence’s positions.
c = Chain.from_sequence((header, _seq)) for s in structures: c.add_structure(s)
- __init__(seq, structures=None, parent=None, children=None)[source]
- Parameters:
seq (ChainSequence) – A chain sequence.
structures (Iterable[ChainStructure] | None) – Chain structures corresponding to a single protein chain specified by _seq.
parent (Chain | None) – A parent chain this chain had descended from.
children (Iterable[Chain] | None) – A collection of children.
- add_structure(structure, *, check_ids=True, map_to_seq=True, map_name='map_canonical', add_to_children=False, **kwargs)[source]
Add a structure to
structures
.- Parameters:
structure (ChainStructure) – A structure of a single chain corresponding to
_seq
.check_ids (bool) – Check that existing
structures
don’t encompass the structure with the sameid()
.map_to_seq (bool) – Align the structure sequence to the
_seq
and create a mapping within the former.map_name (str) – If map_to_seq is
True
, use this map name.add_to_children (bool) – If
True
, will recursively add structure to existing children according to their boundaries mapped to the structure’s numbering. Consequently, this requires mapping, i.e.,map_to_seq=True
.kwargs – Passed to
ChainSequence.map_numbering()
.
- Returns:
Mutates
structures
and returns nothing.- Raises:
ValueError – If check_ids is
True
and the structure id clashes with the existing ones.
- apply_structures(fn, inplace=False)[source]
Apply some function to
structures
.- Parameters:
fn (ApplyT[ChainStructure]) – A callable accepting and returning a chain structure.
inplace (bool) – Apply to
structures
in place. Otherwise, return a copy with only children transformed.
- Returns:
A chain with transformed structures.
- Return type:
t.Self
- filter_structures(pred, inplace=False)[source]
Filter chain
structures
.- Parameters:
pred (FilterT[ChainStructure]) – A callable accepting a chain structure and returning bool.
inplace (bool) – Filter
structures
in place. Otherwise, return a copy with only children transformed.
- Returns:
A chain with filtered structures.
- Return type:
t.Self
- generate_patched_seqs(numbering='numbering', link_name='map_canonical', link_points_to='i', **kwargs)[source]
Generate patched sequences from chain structure sequences.
For explanation of the patching process see
lXtractor.chain.sequence.ChainSequence.patch()
.- Parameters:
numbering (str) – Map name referring to a numbering scheme to infer gaps from.
link_name (str) – Map name linking structure sequence to the canonical sequence.
link_points_to (str) – Map name in the canonical sequence that link_name refers to.
kwargs – Passed to
lXtractor.chain.sequence.ChainSequence.patch()
.
- Returns:
A generator over patched structure sequences.
- Return type:
Generator[ChainSequence, None, None]
- iter_children()[source]
Iterate
children
in topological order.See
ChainSequence.iter_children()
andtopo_iter()
.- Returns:
Iterator over levels of a child tree.
- Return type:
Generator[list[Chain], None, None]
- spawn_child(start, end, name=None, category=None, *, subset_structures=True, tolerate_failure=False, silent=False, keep=True, seq_deep_copy=False, seq_map_from=None, seq_map_closest=True, seq_keep_child=False, str_deep_copy=False, str_map_from=None, str_map_closest=True, str_keep_child=True, str_seq_keep_child=False, str_min_size=1, str_accept_fn=<function Chain.<lambda>>)[source]
Subset a
_seq
and (optionally) each structure instructures
using the provided_seq
boundaries (inclusive).- Parameters:
start (int) – Start coordinate.
end (int) – End coordinate.
name (str | None) – Name of a new chain.
category (str | None) – Spawned child category. Any meaningful tag string that could be used later to group similar children.
subset_structures (bool) – If
True
, subset each structure instructures
. IfFalse
, structures are not inherited.tolerate_failure (bool) – If
True
, a failure to subset a structure doesn’t raise an error.silent (bool) – Supress warnings for errors when tolerate_failure is
True
.keep (bool) – Save created child to
children
.seq_deep_copy (bool) – Deep copy potentially mutable sequences within
_seq
.seq_map_from (str | None) – Use this map to obtain coordinates within
_seq
.seq_map_closest (bool) – Map to the closest matching coordinates of a
_seq
. SeeChainSequence.map_boundaries()
andChainSequence.find_closest()
.seq_keep_child (bool) – Keep a spawned
ChainSequence
as a child within_seq
. Should beFalse
if keep isTrue
to avoid data duplication.str_deep_copy (bool) – Deep copy each sub-structure.
str_map_from (str | None) – Use this map to obtain coordinates within
ChainStructure._seq
of each structure.str_map_closest (bool) – Map to the closest matching coordinates of a
_seq
. SeeChainSequence.map_boundaries()
andChainSequence.find_closest()
.str_keep_child (bool) – Keep a spawned sub-structure as a child in
ChainStructure.children
. Should beFalse
if keep isTrue
to avoid data duplication.str_seq_keep_child (bool) – Keep a sub-sequence of a spawned structure within the
ChainSequence.children
ofChainStructure._seq
of a spawned structure. Should beFalse
if keep or str_keep_child isTrue
to avoid data duplication.str_min_size (int | float) – A minimum number of residues in a structure to be accepted after subsetting.
str_accept_fn (abc.Callable[[ChainStructure], bool]) – A filter function accepting a
ChainStructure
and returning a boolean value indicating whether this structure should be retained instructures
.
- Returns:
A sub-chain with sub-sequence and (optionally) sub-structures.
- Return type:
t.Self
- transfer_seq_mapping(map_name, link_map='map_canonical', link_map_points_to='i', **kwargs)[source]
Transfer sequence mapping to each
ChainStructure._seq
withinstructures
.This method simply utilizes
ChainSequence.relate()
to transfer some map from the_seq
to eachChainStructure._seq
. CheckChainSequence.relate()
for an explanation.- Parameters:
map_name (str) – The name of the map to transfer.
link_map (str) – A name of the map existing within
ChainStructure._seq
of each structure instructures
.link_map_points_to (str) – Which sequence values of the link_map point to.
kwargs – Passed to
ChainSequence.relate()
- Returns:
Nothing.
- write(dest, *, str_fmt='mmtf.gz', write_children=True)[source]
Create a disk dump of this chain data. Created dumps can be reinitialized via
read()
.- Parameters:
dest (Path) – A writable dir to hold the data.
str_fmt (str) – A format to write
structures
in.write_children (bool) – Recursively write
children
.
- Returns:
Path to the directory where the files are written.
- Return type:
Path
- property categories: list[str]
- Returns:
A list of categories from
_seq
’sChainSequence.meta
.
- children: ChainList[Chain]
A collection of children preferably obtained using
spawn_child()
.
- property end: int
- Returns:
Structure sequence’s
end
- property id: str
- Returns:
Chain identifier derived from its
_seq
ID.
- property name: str | None
- Returns:
Structure sequence’s
name
- property parent: t.Self | None
- property seq: ChainSequence
- property start: int
- Returns:
Structure sequence’s
start
- structures: ChainList[ChainStructure]
lXtractor.chain.list module
The module defines the ChainList
- a list of Chain*-type objects that
behaves like a regular list but has additional bells and whistles tailored
towards Chain* data structures.
- class lXtractor.chain.list.ChainList(chains, categories=None)[source]
Bases:
MutableSequence
[CT
]A mutable single-type collection holding either
Chain
’s, orChainSequence
’s, orChainStructure
’s.Object’s funtionality relies on this type purity. Adding of / contatenating with objects of a different type shall raise an error.
It behaves like a regular list with additional functionality.
>>> from lXtractor.chain import ChainSequence >>> s = ChainSequence.from_string('SEQUENCE', name='S') >>> x = ChainSequence.from_string('XXX', name='X') >>> x.meta['category'] = 'x' >>> cl = ChainList([s, s, x]) >>> cl [S|1-8, S|1-8, X|1-3] >>> cl[0] S|1-8 >>> cl['S'] [S|1-8, S|1-8] >>> cl[:2] [S|1-8, S|1-8] >>> cl['1-3'] [X|1-3]
Adding/appending/removing objects of a similar type is easy and works similar to a regular list.
>>> cl += [s] >>> assert len(cl) == 4 >>> cl.remove(s) >>> assert len(cl) == 3
Categories can be accessed as attributes or using
[]
syntax (similar to the Pandas.DataFrame columns).>>> cl.x [X|1-3] >>> cl['x'] [X|1-3]
While creating a chain list, using a groups parameter will assign categories to sequences. Note that such operations return a new
ChainList
object.>>> cl = ChainList([s, x], categories=['S', ['X1', 'X2']]) >>> cl.S [S|1-8] >>> cl.X2 [X|1-3] >>> cl['X1'] [X|1-3]
- __init__(chains, categories=None)[source]
- Parameters:
chains (Iterable[CT]) – An iterable over
Chain*
-type objects.categories (Iterable[str | Iterable[str]] | None) – An optional list of categories. If provided, they will be assigned to inputs’ meta attributes.
- apply(fn, verbose=False, desc='Applying to objects', num_proc=1)[source]
Apply a function to each object and return a new chain list of results.
- collapse()[source]
Collapse all objects and their children within this list into a new chain list. This is a shortcut for
chain_list + chain_list.collapse_children()
.- Returns:
Collapsed list.
- Return type:
ChainList[CT]
- collapse_children()[source]
Collapse all children of each object in this list into a single chain list.
>>> from lXtractor.chain import ChainSequence >>> s = ChainSequence.from_string('ABCDE', name='A') >>> child1 = s.spawn_child(1, 4) >>> child2 = child1.spawn_child(2, 3) >>> cl = ChainList([s]).collapse_children() >>> assert isinstance(cl, ChainList) >>> cl [A|1-4<-(A|1-5), A|2-3<-(A|1-4<-(A|1-5))]
- Returns:
A chain list of all children.
- Return type:
ChainList[CT]
- drop_duplicates(key=<function ChainList.<lambda>>)[source]
- Parameters:
key (abc.Callable[[CT], t.Hashable] | None) – A callable accepting the single element and returning some hashable object associated with that element.
- Returns:
A new list with unique elements as judged by the key.
- Return type:
t.Self
- filter(pred)[source]
>>> from lXtractor.chain import ChainSequence >>> cl = ChainList( ... [ChainSequence.from_string('AAAX', name='A'), ... ChainSequence.from_string('XXX', name='X')] ... ) >>> cl.filter(lambda c: c.seq1[0] == 'A') [A|1-4]
- Parameters:
pred (Callable[[CT], bool]) – Predicate callable for filtering.
- Returns:
A filtered chain list (new object).
- Return type:
ChainList[CT]
- filter_category(name)[source]
- Parameters:
name (str) – Category name.
- Returns:
Filtered objects having this category within their
meta["category"]
.- Return type:
- filter_pos(s, *, match_type='overlap', map_name=None)[source]
Filter to objects encompassing certain consecutive position regions or arbitrary positions’ collections.
For
Chain
andChainStructure
, the filtering is over _seq attributes.- Parameters:
s (lxs.Segment | abc.Collection[Ord]) –
What to search for:
s=Segment(start, end)
to find all objects encompassingcertain region.
[pos1, posX, posN]
to find all objects encompassing thespecified positions.
match_type (str) –
If s is Segment, this value determines the acceptable relationships between s and each
ChainSequence
:”overlap” – it’s enough to overlap with s.
”bounding” – object is accepted if it bounds s.
”bounded” – object is accepted if it’s bounded by s.
map_name (str | None) –
Use this map within to map positions of s. For instance, to each for all elements encompassing region 1-5 of a canonical sequence, one would use
chain_list.filter_pos( s=Segment(1, 5), match_type="bounding", map_name="map_canonical" )
- Returns:
A list of hits of the same type.
- Return type:
ChainList[CS]
- get_level(n)[source]
Get a specific level of a hierarchical tree starting from this list:
l0: this list l1: children of each child of each object in l0 l2: children of each child of each object in l1 ...
- Parameters:
n (int) – The level index (0 indicates this list). Other levels are obtained via
iter_children()
.- Returns:
A chain list of object corresponding to a specific topological level of a child tree.
- Return type:
ChainList[CT]
- groupby(key)[source]
Group sequences in this list by a given key.
- Parameters:
key (abc.Callable[[CT], T]) – Some callable accepting a single chain and returning a grouper value.
- Returns:
An iterator over pairs
(group, chains)
, wherechains
is a chain list of chains that belong togroup
.- Return type:
abc.Iterator[tuple[T, t.Self]]
- index(value[, start[, stop]]) integer -- return first index of value. [source]
Raises ValueError if the value is not present.
Supporting start and stop arguments is optional, but recommended.
- Return type:
int
- iter_children()[source]
Simultaneously iterate over topological levels of children.
>>> from lXtractor.chain import ChainSequence >>> s = ChainSequence.from_string('ABCDE', name='A') >>> child1 = s.spawn_child(1, 4) >>> child2 = child1.spawn_child(2, 3) >>> x = ChainSequence.from_string('XXXX', name='X') >>> child3 = x.spawn_child(1, 3) >>> cl = ChainList([s, x]) >>> list(cl.iter_children()) [[A|1-4<-(A|1-5), X|1-3<-(X|1-4)], [A|2-3<-(A|1-4<-(A|1-5))]]
- Returns:
An iterator over chain lists of children levels.
- Return type:
Generator[ChainList[CT], None, None]
- iter_ids()[source]
Iterate over ids of this chain list.
- Returns:
An iterator over chain ids.
- Return type:
Iterator[str]
- iter_sequences()[source]
- Returns:
An iterator over
ChainSequence
’s.- Return type:
abc.Generator[ChainSequence, None, None]
- iter_structure_sequences()[source]
- Returns:
Iterate over
ChainStructure._seq
attributes.- Return type:
abc.Generator[ChainSequence, None, None]
- iter_structures()[source]
- Returns:
An generator over
ChainStructure
’s.- Return type:
abc.Generator[ChainStructure, None, None]
- property categories: Set[str]
- Returns:
A set of categories inferred from meta of encompassed objects.
- property ids: list[str]
- Returns:
A list of ids for all chains in this list.
- property sequences: ChainList[ChainSequence]
- Returns:
Get all
lXtractor.core.chain.Chain._seq
or lXtractor.core.chain.sequence.ChainSequence objects within this chain list.
- property structure_sequences: ChainList[ChainSequence]
- property structures: ChainList[ChainStructure]
lXtractor.chain.io module
- class lXtractor.chain.io.ChainIO(num_proc=1, verbose=False, tolerate_failures=False)[source]
Bases:
object
A class handling reading/writing collections of Chain* objects.
- __init__(num_proc=1, verbose=False, tolerate_failures=False)[source]
- Parameters:
num_proc (int) – The number of parallel processes. Using more processes is especially beneficial for
ChainStructure
’s andChain
’s with structures. Otherwise, the increasing this number may not reduce or actually worsen the time needed to read/write objects.verbose (bool) – Output logging and progress bar.
tolerate_failures (bool) – Errors when reading/writing do not raise an exception.
- read(obj_type, path, callbacks=(), **kwargs)[source]
Read
obj_type
-type objects from a path or an iterable of paths.- Parameters:
obj_type (Type[CT]) – Some class with
@classmethod(read(path))
.path (Path | Iterable[Path]) – Path to the dump to read from. It’s a path to directory holding files necessary to init a given obj_type, or an iterable over such paths.
callbacks (Sequence[Callable[[CT], CT]]) – Callables applied sequentially to parsed object.
kwargs – Passed to the object’s
read()
method.
- Returns:
A generator over initialized objects or futures.
- Return type:
Generator[CT | None, None, None]
- read_chain(path, **kwargs)[source]
Read
Chain
’s from the provided path.If path contains signature files and directories (such as sequence.tsv and segments), it is assumed to contain a single object. Otherwise, it is assumed to contain multiple
Chain
objects.
- read_chain_seq(path, **kwargs)[source]
Read
ChainSequence
’s from the provided path.If path contains signature files and directories (such as sequence.tsv and segments), it is assumed to contain a single object. Otherwise, it is assumed to contain multiple
ChainSequence
objects.- Parameters:
path (Path | Iterable[Path]) – Path to a dump or a dir of dumps.
kwargs – Passed to
read()
.
- Returns:
An iterator over
ChainSequence
objects.- Return type:
Generator[ChainSequence | None, None, None]
- read_chain_str(path, **kwargs)[source]
Read
ChainStructure
’s from the provided path.If path contains signature files and directories (such as structure.cif and segments), it is assumed to contain a single object. Otherwise, it is assumed to contain multiple
ChainStructure
objects.- Parameters:
path (Path | Iterable[Path]) – Path to a dump or a dir of dumps.
kwargs – Passed to
read()
.
- Returns:
An iterator over
ChainStructure
objects.- Return type:
Generator[ChainStructure | None, None, None]
- write(chains, base, overwrite=False, **kwargs)[source]
- Parameters:
chains (CT | Iterable[CT]) – A single or multiple chains to write.
base (Path) – A writable dir. For multiple chains, will use base/chain.id directory.
overwrite (bool) – If the destination folder exists,
False
means returning the destination path without attempting to write the chain, whereasTrue
results in an explicit.write()
call.kwargs – Passed to a chain’s write method.
- Returns:
Whatever write method returns.
- Return type:
Generator[Path | None | Future, None, None]
- num_proc
The number of parallel processes
- tolerate_failures
Errors when reading/writing do not raise an exception.
- verbose
Output logging and progress bar.
- class lXtractor.chain.io.ChainIOConfig(num_proc: 'int' = 1, verbose: 'bool' = False, tolerate_failures: 'bool' = False)[source]
Bases:
object
- __init__(num_proc=1, verbose=False, tolerate_failures=False)
- num_proc: int = 1
- tolerate_failures: bool = False
- verbose: bool = False
- lXtractor.chain.io.read_chains(paths, children, *, seq_cfg=ChainIOConfig(num_proc=1, verbose=False, tolerate_failures=False), str_cfg=ChainIOConfig(num_proc=1, verbose=False, tolerate_failures=False), seq_callbacks=(), str_callbacks=(), seq_kwargs=None, str_kwargs=None)[source]
Reads saved
lXtractor.core.chain.chain.Chain
objects without invokinglXtractor.core.chain.chain.Chain.read()
. Instead, it will use separateChainIO
instances to read chain sequences and chain structures. The output is identical toChainIO.read_chain_seq()
.Consider using it for:
For parallel parsing of
Chain
objects with many structures.For separate treatment of chain sequences and chain structures.
For better customization of chain sequences and structures parsing.
- Parameters:
paths (Path | Sequence[Path]) – A path or a sequence of paths to chains.
children (bool) – Search for, parse and integrate all nested children.
seq_cfg (ChainIOConfig) –
ChainIO
config for chain sequences parsing.str_cfg (ChainIOConfig) – … for chain structures parsing.
seq_callbacks (Sequence[Callable[[CT], CT]]) – A (potentially empty) sequence passed to the reader. Each callback must accept and return a single chain sequence.
str_callbacks (Sequence[Callable[[CT], CT]]) – … Same for the structures.
seq_kwargs (dict[str, Any] | None) – Passed to
lXtractor.core.chain.sequence.ChainSequence.read()
.str_kwargs (dict[str, Any] | None) – Passed to
lXtractor.core.chain.structure.ChainStructure.read()
.
- Returns:
A chain list of parsed chains.
- Return type:
lXtractor.chain.initializer module
A module encompassing the ChainInitializer
used to init Chain*
-type
objects from various input types. It enables parallelization of reading structures
and seq2seq mappings and is flexible thanks to callbacks.
- class lXtractor.chain.initializer.ChainInitializer(tolerate_failures=False, verbose=False)[source]
Bases:
object
In contrast to
ChainIO
, this object initializes newChain
,ChainStructure
, orChain
objects from various input types.To initialize
Chain
objects, usefrom_mapping()
.To initialize
ChainSequence
orChainStructure
objects, usefrom_iterable()
.- __init__(tolerate_failures=False, verbose=False)[source]
- Parameters:
tolerate_failures (bool) – Don’t stop the execution if some object fails to initialize.
verbose (bool) – Output progress bars.
- from_iterable(it, num_proc=1, callbacks=None, desc='Initializing objects')[source]
Initialize
ChainSequence`s or/and :class:`ChainStructure
’s from (possibly heterogeneous) iterable.- Parameters:
it (abc.Iterable[ChainSequence | ChainStructure | Path | tuple[Path, abc.Sequence[str]] | tuple[str, str] | GenericStructure]) –
- Supported elements are:
Initialized objects (passed without any actions).
Path to a sequence or a structure file.
(Path to a structure file, list of target chains).
A pair (header, _seq) to initialize a
ChainSequence
.A
GenericStructure
with a single chain.
num_proc (int) – The number of processes to use.
callbacks (abc.Sequence[SingletonCallback] | None) – A sequence of callables accepting and returning an initialized object.
desc (str) – Progress bar description used if
verbose
isTrue
.
- Returns:
A generator yielding initialized chain sequences and structures parsed from the inputs.
- Return type:
abc.Generator[_O | Future, None, None]
- from_mapping(m, key_callbacks=None, val_callbacks=None, item_callbacks=None, *, map_numberings=True, num_proc_read_seq=1, num_proc_read_str=1, num_proc_item_callbacks=1, num_proc_map_numbering=1, num_proc_add_structure=1, **kwargs)[source]
Initialize
Chain
’s from mapping between sequences and structures.It will first initialize objects to which the elements of m refer (see below) and then create maps between each sequence and associated structures, saving these into structure
ChainStructure._seq
’s.Note
key/value_callback
are distributed to parser and applied right after parsing the object. As a result, their application will be parallelized depending on the``num_proc_read_seq`` andnum_proc_read_str
parameters.- Parameters:
m (abc.Mapping[ChainSequence | Chain | tuple[str, str] | Path, abc.Sequence[ChainStructure | GenericStructure | bst.AtomArray | Path | tuple[Path, abc.Sequence[str]]]]) –
A mapping of the form
{_seq => [structures]}
, where _seq is one of:Initialized
ChainSequence
.A pair (header, _seq).
A path to a fasta file containing a single sequence.
While each structure is one of:
Initialized
ChainStructure
.GenericStructure
with a single chain.biotite.AtomArray
corresponding to a single chain.A path to a structure file.
(A path to a structure file, list of target chains).
In the latter two cases, the chains will be expanded and associated with the same sequence.
key_callbacks (abc.Sequence[SingletonCallback] | None) – A sequence of callables accepting and returning a
ChainSequence
.val_callbacks (abc.Sequence[SingletonCallback] | None) – A sequence of callables accepting and returning a
ChainStructure
.item_callbacks (abc.Sequence[ItemCallback] | None) – A sequence of callables accepting and returning a parsed item – a tuple of
Chain
and a sequence of associatedChainStructure`s. Callbacks are applied sequentially to each item as a function composition in the supplied order (left to right). It the last callback returns ``None`
as a first element or an empty list as a second element, such item will be filtered out. Item callbacks are applied after parsing sequences and structures and converting chain sequences to chains.map_numberings (bool) – Map PDB numberings to canonical sequence’s numbering via pairwise sequence alignments.
num_proc_read_seq (int) – A number of processes to devote to sequence parsing. Typically, sequence reading doesn’t benefit from parallel processing, so it’s better to leave this default.
num_proc_read_str (int) – A number of processes dedicated to structures parsing.
num_proc_item_callbacks (int) – A number of CPUs to parallelize item callbacks’ application.
num_proc_map_numbering (int) – A number of processes to use for mapping between numbering of sequences and structures. Generally, this should be as high as possible for faster processing. In contrast to the other operations here, this one seems more CPU-bound and less resource hungry (although, keep in mind the size of the canonical sequence: if it’s too high, the RAM usage will likely explode). If
None
, will default tonum_proc
.num_proc_add_structure (int) – In case of parallel numberings mapping, i.e, when
num_proc_map_numbering > 1
, this option allows to transfer these numberings and add structures to chains in parallel. It may be useful to whenadd_to_children=True
is passed inkwargs
as it allows creating sub-structures in parallel.kwargs – Passed to
Chain.add_structure()
.
- Returns:
A list of initialized chains.
- Return type:
- property supported_seq_ext: list[str]
- Returns:
Supported sequence file extensions.
- property supported_str_ext: list[str]
- Returns:
Supported structure file extensions.
- class lXtractor.chain.initializer.ItemCallback(*args, **kwargs)[source]
Bases:
Protocol
A callback applied to processed items in
ChainInitializer.from_mapping()
.- __call__(inp)[source]
Call self as a function.
- Return type:
tuple[Chain | None, list[ChainStructure]]
- __init__(*args, **kwargs)
- class lXtractor.chain.initializer.SingletonCallback(*args, **kwargs)[source]
Bases:
Protocol
A protocol defining signature for a callback used with
ChainInitializer
on single objects right after parsing.- __call__(inp: CT) CT | None [source]
- __call__(inp: list[ChainStructure]) list[ChainStructure] | None
- __call__(inp: None) None
Call self as a function.
- __init__(*args, **kwargs)
lXtractor.chain.tree module
A module to handle the ancestral tree of the Chain*-type objects defined
by their parent
/children
attributes and/or meta
info.
- lXtractor.chain.tree.list_ancestors(c)[source]
>>> o = ChainSequence.from_string('x' * 5, 1, 5, 'C') >>> c13 = o.spawn_child(1, 3) >>> c12 = c13.spawn_child(1, 2) >>> list_ancestors(c12) [C|1-3<-(C|1-5), C|1-5]
- Parameters:
c (Chain | ChainSequence | ChainStructure) – Chain*-type object.
- Returns:
A list ancestor objects obtained from the
parent
attribute..- Return type:
list[Chain | ChainSequence | ChainStructure]
- lXtractor.chain.tree.list_ancestors_names(id_or_chain)[source]
>>> list_ancestors_names('C|1-5<-(C|1-3<-(C|1-2))') ['C|1-3', 'C|1-2']
- Parameters:
id_or_chain (Chain | ChainSequence | ChainStructure | str) – Chain*-type object or its id.
- Returns:
A list of parents ‘{name}|{start}-{end}’ representations parsed from the object’s id.
- Return type:
list[str]
- lXtractor.chain.tree.make(chains, connect=False, objects=False, check_is_tree=True)[source]
Make an ancestral tree – a directed graph representing ancestral relationships between chains.
- Parameters:
chains (Iterable[Chain | ChainSequence | ChainStructure]) – An iterable of Chain*-type objects.
connect (bool) – Connect actual objects by populating
.children
and.parent
attributes.objects (bool) – Create an object tree using
make_obj_tree()
. Otherwise, create a “string” tree usingmake_str_tree()
. Check the docs of these functions to understand the differences.check_is_tree (bool) – If
True
, check if the obtained graph is actually a tree. If it’s not, raiseValueError
.
- Returns:
- Return type:
DiGraph
- lXtractor.chain.tree.make_filled(name, _t)[source]
Make a “filled” version of an object to occupy the tree.
- Parameters:
name (str) – Name of the node obtained via
node_name()
._t (CT | Type[CT]) – Some Chain*-type object.
- Returns:
An object with filled sequence. If it’s a
ChainStructure
object, it will have an empty structure.- Return type:
CT
- lXtractor.chain.tree.make_obj_tree(chains, connect=False, check_is_tree=True)[source]
Make an ancestral tree – a directed graph representing ancestral relationships between chains. The nodes of the tree are Chain*-type objects. Hence, they must be hashable. This restricts types of sequences valid for
ChainSequence
toabc.Sequence[abc.Hashable]
.As a useful side effect, this function can aid in filling the gaps in the actual tree indicated by the id-relationship suggested by the “id” field of the
meta
property. In other words, if a segment S|1-2 was obtained by spawning from S|1-5, S|1-2’s id will reflect this:>>> s = make_filled('S|1-5', ChainSequence.make_empty()) >>> c12 = s.spawn_child(1, 2) >>> c12 S|1-2<-(S|1-5)
However, if S|1-5 was lost (e.g., by writing/reading S|1-2 to/from disk), and S|1-2.parent is None, we can use ID stored in meta to recover ancestral relationships. This function will attend to such cases and create a filler object S|1-5 with a “*”-filled sequence.
>>> c12.parent = None >>> c12 S|1-2 >>> c12.meta['id'] 'S|1-2<-(S|1-5)' >>> ct = make_obj_tree([c12],connect=True) >>> assert len(ct.nodes) == 2 >>> [n.id for n in ct.nodes] ['S|1-2<-(S|1-5)', 'S|1-5']
- Parameters:
chains (Iterable[CT]) – A homogeneous iterable of Chain*-type objects.
connect (bool) – If
True
, connect both supplied and created filler objects viachildren
andparent
attributes.check_is_tree (bool) – If
True
, check if the obtained graph is actually a tree. If it’s not, raiseValueError
.
- Returns:
A networkx’s directed graph with Chain*-type objects as nodes.
- Return type:
DiGraph
- lXtractor.chain.tree.make_str_tree(chains, connect=False, check_is_tree=True)[source]
A computationally cheaper alternative to
make_obj_tree()
, where nodes are string objects, while actual objects reside in a node attribute “objs”. It allows for a faster tree construction since it avoids expensive hashing of Chain*-type objects.- Parameters:
chains (Iterable[Chain | ChainSequence | ChainStructure]) – An iterable of Chain*-type objects.
connect (bool) – If
True
, connect both supplied and created filler objects viachildren
andparent
attributes.check_is_tree (bool) – If
True
, check if the obtained graph is actually a tree. If it’s not, raiseValueError
.
- Returns:
A networkx’s directed graph.
- Return type:
DiGraph
- lXtractor.chain.tree.recover(c)[source]
Recover ancestral relationships of a Chain*-type object. This will use
make_str_tree()
to recover ancestors from object IDs of an object itself and any encompassed children.- ..note ::
It may be used as a callback in
lXtractor.chain.io.ChainIO.read()
- ..note ::
make_str_tree()
creates “filled” parents viamake_filled()
- Parameters:
c (Chain | ChainSequence | ChainStructure) – A Chain*-type object.
- Returns:
The same object with populated
children
andparent
attributes.- Return type: