Database#

class pyskani.Database(path=None, *, compression=125, marker_compression=1000, k=15)#

A database storing sketched genomes.

The database contains two different sketch collections with different compression levels: marker sketches, which are heavily compressed, and always kept in memory; and genome sketches, which take more memory, but may be stored inside an external file.

flush()#

Flush the database.

This does nothing for a database loaded in memory. For a database stored in a folder, this will save the markers into a file named markers.bin.

load()#

Load a database from a folder containing sketches.

The sketches will be loaded in memory to speed-up querying. To reduce memory consumption and load sketches lazily from the folder, use Database.open.

Parameters:

path (str, bytes, or os.PathLike) – The path to the folder containing the sketched references.

Returns:

Database – A database with all sketches loaded in memory.

Raises:
  • OSError – When the files from the folder could not be opened.

  • ValueError – When the sketches could not be deserialized.

open()#

Open a database from a folder containing sketches.

The marker sketches will be loaded in memory, but the sketches will be loaded only when needed when querying. To speed-up querying by pre-fetching sketches, use Database.load.

Parameters:

path (str, bytes, or os.PathLike) – The path to the folder containing the sketched references.

Returns:

Database – A database with only markers loaded in memory.

Raises:
  • OSError – When the files from the folder could not be opened.

  • ValueError – When the markers could not be deserialized.

query(name, *contigs, seed=True, learned_ani=None, median=False, robust=False)#

Query the database with a genome.

Parameters:
Keyword Arguments:
  • seed (bool) – Compute seed positions while sketching the query.

  • learned_ani (bool or None) – Use a regression model to compute ANI, using a model trained on MAGs. Pass True or False to force enabling or disabling the model, respectively. By default, the regression model is enabled when the sketch compression factor is >=70.

  • median (bool) – Estimate median identity instead of average identity. Disabled by default.

  • robust (bool) – Estimate mean after trim off 10%/90% quantiles. Disabled by default.

Returns:

list of Hit – The hits found for the query.

save(path, overwrite=False)#

Save the database to the given path.

sketch(name, *contigs, seed=True)#

Add a reference genome to the database.

This method is a shortcut for Database.add_draft when a genome is complete (i.e. only contains a single contig).

Parameters:
Keyword Arguments:

seed (bool) – Compute seed positions while sketching the query.

path#

The path where sketches are stored.

Type:

pathlib.Path or None