Database#
- class pyskani.Database(path=None, *, compression=125, marker_compression=1000, k=15, format=None)#
A database storing sketched genomes.
The database contains two different sketch collections with different compression levels: marker sketches, which are heavily compressed, and always kept in memory; and genome sketches, which take more memory, but may be stored inside an external file.
- flush()#
Flush the database buffers to disk.
This does nothing for a database loaded in memory. For a database stored in a folder, this will save the markers into a file named
markers.bin.
- load()#
Load a database from a folder containing sketches.
The sketches will be loaded in memory to speed-up querying. To reduce memory consumption and load sketches lazily from the folder, use
Database.open.- Parameters:
path (
str,bytes, oros.PathLike) – The path to the folder containing the sketched references.- Returns:
Database– A database with all sketches loaded in memory.- Raises:
OSError – When the files from the folder could not be opened.
ValueError – When the sketches could not be deserialized.
- open()#
Open a database from a folder containing sketches.
The marker sketches will be loaded in memory, but the sketches will be loaded only when needed when querying. To speed-up querying by pre-fetching sketches, use
Database.load.- Parameters:
path (
str,bytes, oros.PathLike) – The path to the folder containing the sketched references.- Returns:
Database– A database with only markers loaded in memory.- Raises:
OSError – When the files from the folder could not be opened.
ValueError – When the markers could not be deserialized.
- query(name, *contigs, seed=True, learned_ani=None, median=False, robust=False, cutoff=None, faster_small=False)#
Query the database with a genome.
- Parameters:
name (
str) – The name of the query genome.contigs (
str,bytes,bytearrayormemoryview) – The contigs of the query genome.
- Keyword Arguments:
seed (
bool) – Compute seed positions while sketching the query.learned_ani (
boolorNone) – Use a regression model to compute ANI, using a model trained on MAGs. PassTrueorFalseto force enabling or disabling the model, respectively. By default, the regression model is enabled when the sketch compression factor is >=70 and not running in median mode.median (
bool) – Estimate median identity instead of average identity. Disabled by default. Equivalent to the--medianflag of the CLI.robust (
bool) – Estimate mean after trim off 10%/90% quantiles. Disabled by default. Equivalent to the--robustflag of the CLI.cutoff (
floatorNone) – The cutoff to use to screen out pairs with approximately lower identity, as computed with k-mer sketching. Defaults to 0.8 for ANI and 0.6 for AAI. Equivalent to the-sflag from the CLI.faster_small (
bool) – Set toTrueto filter genomes with less than 20 marker k-mers more aggressively. Disabled by default. Equivalent to the--faster-smallflag of the CLI.
- Returns:
- ..versionadded:: 0.2.0
The
cutoffandfaster_smallkeyword arguments.
- save(path, overwrite=False, format=None)#
Save the database to the given path.
- sketch(name, *contigs, seed=True)#
Add a reference genome to the database.
- path#
The path where sketches are stored.
- Type:
pathlib.PathorNone