Browse Source

Add more documentation.

Start properly documenting the project in a Sphinx-based manual.
Damien Goutte-Gattat 5 months ago
  1. 27
  2. 34
  3. 71
  4. 81
  5. 97
  6. 22
  7. 19
  8. 78
  9. 7
  10. 178
  11. 4
  12. 10


@ -0,0 +1,27 @@
# -*- coding: utf-8 -*-
# -- Project information -----------------------------------------------------
project = 'Incenp.Bioutils'
copyright = '2021 Damien Goutte-Gattat'
author = 'Damien Goutte-Gattat <>'
# -- General configuration ---------------------------------------------------
source_suffix = '.rst'
master_doc = 'index'
language = 'en'
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
pygments_style = 'sphinx'
extensions = [
# -- Options for HTML output -------------------------------------------------
html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']


@ -0,0 +1,34 @@
.. Incenp.Bioutils documentation master file, created by
sphinx-quickstart on Tue Jan 5 22:29:38 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Incenp.Bioutils Documentation
Incenp.Bioutils is a set of command line utilities and helper Python
modules for computational biology. It is built on top of `Biopython`_.
.. _Biopython:
Incenp.Bioutils is free software, published under the GNU General Public
License. It is written in Python and should work Python 3.6+.
.. toctree::
:maxdepth: 2
:caption: Contents:
Indices and tables
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`


@ -0,0 +1,71 @@
Installing Incenp.Bioutils
Installing from PyPI
Packages for Incenp.Bioutils are published on the
`Python Package Index`_ under the name ``incenp.bioutils``. To install
the latest version from PyPI:
.. _Python Package Index:
.. code-block:: console
$ pip install -U incenp.bioutils
Installing from source
You may download a release tarball from the `homepage`_ or from the
`release page`_, and then proceed to a manual installation:
.. _homepage:
.. _release page:
.. code-block:: console
$ tar zxf incenp.bioutils-|version|.tar.gz
$ cd incenp.bioutils-|version|
$ python build
$ python install
You may also clone the repository:
.. code-block:: console
$ git clone
and then proceed as above.
Incenp.Bioutils requires the following Python dependencies to work:
* `Biopython <>`_
* `Click <>`_
* `Click-Shell <>`_
If you install Incenp.Bioutils from the Python Package Index with `pip`
as described above, those dependencies should be automatically installed
if they are not already available on your system.
Testing the installation
You can check whether Incenp.Bioutils has been installed correctly by
trying to invoke one of the command-line utilities it provides:
.. code-block:: console
$ seqtool --version
seqtool |version|
Copyright © 2020 Damien Goutte-Gattat
This program is released under the GNU General Public License.
See the COPYING file or <>


@ -0,0 +1,81 @@
The seqtool command
The ``seqtool`` command provides subcommands to perform various
operations on sequence files.
All subcommands operate on sequences specified as :doc:`Uniform Sequence
Addresses <usa>`.
The cat subcommand
The ``cat`` subcommand is analog to the Unix command of the same name.
It reads sequence files and writes out another sequence file. It can be
used for several purposes:
* converting a sequence file from one format to another;
* clean up a sequence file from spurious annotations left by some
molecular biology programs;
* catenate several sequences into a single sequence (keeping all the
associated sequence features).
Converting a Genbank file into a FASTA file::
$ seqtool cat -o fasta::file.fasta
Creating a new sequence by catenating several input sequences together,
and setting some annotations in the resulting sequence::
$ seqtool -o \
--name "NEWSEQ" \
--description "This is my catenated sequence" \
--division SYN \
fasta::left.fasta abi::middle.ab1
The siresist subcommand
The ``siresist`` subcommand takes a DNA sequence as input and generate
a new DNA sequence with the same translation but using different codons.
The gateway subcommand
The ``gateway`` subcommand performs *in silico* Gatewayⓡ reactions. It
automatically detects the appropriate *attB/P/L/R* sites within the two
input sequences and generates an output sequence with all the features
appropriately copied over.
The plasmm subcommand
Given the annotated sequence of a plasmid, the ``plasmm`` subcommand
generates a PDF file describing that plasmid, with a map and list of the
main sequence features.
The blast and dotter subcommands
Those are wrappers for the different BLAST commands (``blastn``,
``blastp``, ``tblastn``, and so on) from the `NCBI BLAST package`_ and
the ``dotter`` command from the `SeqTools package`_.
.. _NCBI BLAST package:
.. _SeqTools package:
They don’t give access to all the options of the original programs, but
their main interest is that they can be used with sequences in any format
supported by Biopython’s ``SeqIO`` module, whereas the original programs
only read files in the FASTA format.


@ -0,0 +1,97 @@
The seqvault command
The ``seqvault`` command provides a command-line interface to `BioSQL`_
.. _BioSQL:
It is intended to be used with a slightly modified version of the BioSQL
database schema (provided with ``Incenp.Bioutils`` source code in the
``biosql`` directory), where every `biodatabase` is associated with a
3-letters prefix. That prefix is then used to automatically assign
accession numbers (of the form ``PRE_xxxxxx``, where ``PRE`` is the
prefix) when importing sequences into the database. However ``seqvault``
can also be used with pristine BioSQL databases.
Setting up the BioSQL database
If you don’t already have a BioSQL database (or access to one), follow
those instructions to setup one.
With PostgreSQL
Create a new PostgreSQL user account and a new database::
# createuser <username>
# createdb -O owner <username> <dbname>
Initialize the newly created database by running the provided
``biosql/biosqldb-pgsql.sql`` script::
# psql -h localhost -U <username> <dbname> < biosql/biosqldb-pg.sql
With SQLite
Create and initialize the database with the following command::
$ sqlite3 mydb.sqlite < biosql/biosqldb-sqlite.sql
Configuring seqvault
Create an INI-style configuration file named ``databases.ini`` in the
``$XDG_CONFIG_HOME/bioutils`` directory, describing the BioSQL server(s)
to use with ``seqvault``. For example, to access the two databases
created above, use the following file::
type: biosql
driver: psycopg2
host: localhost
user: <username>
database: <dbname>
type: biosql
driver: sqlite3
database: <path/to/mydb.sqlite>
If the ``username`` user account on the PostgreSQL is password-protected,
add a ``password`` option in the corresponding section.
The ``seqvault`` program will by default connect to the first server
described in the configuration file. Use the ``-s`` option to choose
another section from the configuration file.
Using seqvault
The following examples show some typical uses of ``seqvault``.
Creating a new BioSQL subdatabase named *plasmids* with the prefix
$ seqvault newdb -p PLM plasmids
Importing a sequence from a file into the subdatabase::
$ seqvault add plasmids
Listing all sequences in a subdatabase::
$ seqvault list plasmids
Extracting a sequence from a subdatabase::
$ seqvault get PLM_123456


@ -0,0 +1,22 @@ package
---------- module
.. automodule::
Module contents
.. automodule::


@ -0,0 +1,19 @@ package
.. toctree::
:maxdepth: 4
Module contents
.. automodule::


@ -0,0 +1,78 @@ package
---------- module
.. automodule::
:show-inheritance: module
.. automodule::
:show-inheritance: module
.. automodule::
:show-inheritance: module
.. automodule::
:show-inheritance: module
.. automodule::
:members: USA, parse_usa, read_usa, write_usa
:show-inheritance: module
.. automodule::
:show-inheritance: module
.. automodule::
:show-inheritance: module
.. automodule::
Module contents
.. automodule::


@ -0,0 +1,7 @@
API Documentation
.. toctree::
:maxdepth: 4


@ -0,0 +1,178 @@
Uniform Sequence Addresses
The ``Incenp.Bioutils`` package supports the `Uniform Sequence Address`_
scheme designed and used by the `EMBOSS package`_. All command-line
tools can read and write sequences from and to a location specified by
such addresses.
.. _Uniform Sequence Address:
.. _EMBOSS package:
Principle and examples
Briefly, a *Uniform Sequence Address* or *USA* is a unified way to
specify the location and optionally the format of a biological sequence.
Please see the EMBOSS document referred to above for a complete
description of USAs, including a formal specification of their syntax.
Here are some examples of USAs:
Get all sequences in the file ````, expected to be in the
*Genbank* format.
Get the segment 20..100 from all sequences in the *FASTA* file
Same as the previous example, but reverse-complement the segments.
Get the sequence named ``SEQ1`` from the *Genbank* file ````.
Get the sequence named ``SEQ`` from the database ``mydb``.
Configuration of databases
To fetch sequences from biological databases as in the last example
above, the databases to use must first be described in a configuration
file located in ``$XDG_CONFIG_HOME/bioutils/databases.ini``.
Each section in this INI-style file describes a database. The database
identifier in a USA must match the name of one of the sections in the
file. For example, the last USA above assumes the ``databases.ini`` file
contains a section named *mydb*.
Within a section, the ``type`` parameter indicates the type of database.
Supported database types are:
A SQL database using the BioSQL schema, as supported by Biopython.
The ExPASy server.
One of the NCBI Entrez database.
BioSQL databases
A section describing a BioSQL database must contain at least the
following parameters:
Indicates the Python SQL driver (dependent on the underlying SQL
database server; for example, ``psycopg2`` for a PostgreSQL server,
or ``sqlite3`` for a SQLite database).
The name of the database. For a SQLite database, this is the path
to the database file.
For non-SQLite servers, other parameters indicate how to connect to the
server: ``host`` for the server’s hostname, ``user`` for the name of the
account on the server, ``password`` for the associated password (this
last one may be absent, if the account is not password-protected).
An optional parameter ``subdb`` may contain the name of a BioSQL
subdatabase. If that parameter is present in a section, USAs referring
to that section will only look for sequences in the corresponding
subdatabase (the default is to look in the entire database, regardless
of subdatabases).
If several sections refer to the same BioSQL server (e.g. to describe
several subdatabases in the same server), the connection parameters
(``driver``, ``database``, ``host``, ``user`` and ``password``) may be
replaced by a single ``server`` parameter containing the name of another
section in the file where those parameters are defined.
For example, assuming a PostgreSQL-based BioSQL server containing two
subdatabases named *plasmids* and *genes*, one can have the following
``databases.ini`` file::
type: biosql
driver: psycopg2
host: localhost
user: myuser
password: mypassword
database: mydatabase
type: biosql
server: myserver
subdb: plasmids
type: biosql
server: myserver
subdb: genes
With such a file, the USA ``myserver:SEQ1`` will look for a sequence
named *SEQ1* in all the subdatabases on the server, whereas the USA
``plasmids:SEQ2`` will look for a sequence named *SEQ2* only in the
*plasmids* subdatabase.
ExPAsY database
This type of database does not need any parameter. USAs referring to
such a database will be resolved by querying directly the ExPASy server.
It is only possible to refer to a sequence by its accession number.
Field-based queries, as described in the USA specification, are not
Example configuration::
type: expasy
Example USA::
Entrez databases
This type of database expects the following parameters:
The email address to send to the NCBI server along with each query.
The Entrez database to use. It can be ``nuccore`` for the DNA/RNA
database, or ``protein`` for the protein database.
As for the ExPASy database type, only references by accession numbers
are supported.
Example configuration::
type: entrez
database: nuccore
type: entrez
database: protein
Example USAs::


@ -60,7 +60,7 @@ class DatabaseProvider(object):
* ``host`` for the hostname of the SQL server;
* ``user`` for the user account to connect to the server with;
* ``password`` for the associated password;
* ``name`` for the SQL database name;
* ``database`` for the SQL database name;
* ``subdb`` for the name of the BioSQL subdatabase, if any.
ExPASy database (``type: expasy``)
@ -281,7 +281,7 @@ class DatabaseAdapter(object):
will most likely not support it.
:return: the database records, as a list of
:class:`Bio.SeqRecord.SeqRecord` objects (or objects with a
:class:`Bio.SeqRecord.SeqRecord` objects (or objects with a
compatible interface, such as


@ -60,5 +60,13 @@ setup(
'seqvault =',
'cc3d-runner ='
'build_sphinx': {
'project': ('', 'Incenp.Bioutils'),
'version': ('', __version__),
'release': ('', __version__)