Browse Source

Add more documentation.

Start properly documenting the project in a Sphinx-based manual.
master
Damien Goutte-Gattat 11 months ago
parent
commit
671486d80c
  1. 27
      docs/conf.py
  2. 34
      docs/index.rst
  3. 71
      docs/install.rst
  4. 81
      docs/seqtool.rst
  5. 97
      docs/seqvault.rst
  6. 22
      docs/source/incenp.bio.modelling.rst
  7. 19
      docs/source/incenp.bio.rst
  8. 78
      docs/source/incenp.bio.seq.rst
  9. 7
      docs/source/modules.rst
  10. 178
      docs/usa.rst
  11. 4
      incenp/bio/seq/databases.py
  12. 10
      setup.py

27
docs/conf.py

@ -0,0 +1,27 @@
# -*- coding: utf-8 -*-
# -- Project information -----------------------------------------------------
project = 'Incenp.Bioutils'
copyright = '2021 Damien Goutte-Gattat'
author = 'Damien Goutte-Gattat <dgouttegattat@incenp.org>'
# -- General configuration ---------------------------------------------------
source_suffix = '.rst'
master_doc = 'index'
language = 'en'
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
pygments_style = 'sphinx'
extensions = [
'sphinx.ext.autodoc'
]
# -- Options for HTML output -------------------------------------------------
html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']

34
docs/index.rst

@ -0,0 +1,34 @@
.. Incenp.Bioutils documentation master file, created by
sphinx-quickstart on Tue Jan 5 22:29:38 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Incenp.Bioutils Documentation
=============================
Incenp.Bioutils is a set of command line utilities and helper Python
modules for computational biology. It is built on top of `Biopython`_.
.. _Biopython: https://biopython.org/
Incenp.Bioutils is free software, published under the GNU General Public
License. It is written in Python and should work Python 3.6+.
.. toctree::
:maxdepth: 2
:caption: Contents:
install
seqtool
seqvault
usa
source/modules
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

71
docs/install.rst

@ -0,0 +1,71 @@
**************************
Installing Incenp.Bioutils
**************************
Installing from PyPI
====================
Packages for Incenp.Bioutils are published on the
`Python Package Index`_ under the name ``incenp.bioutils``. To install
the latest version from PyPI:
.. _Python Package Index: https://pypi.org/project/incenp.bioutils/
.. code-block:: console
$ pip install -U incenp.bioutils
Installing from source
======================
You may download a release tarball from the `homepage`_ or from the
`release page`_, and then proceed to a manual installation:
.. _homepage: https://incenp.org/dvlpt/bioutils.html
.. _release page: https://git.incenp.org/damien/bioutils/releases
.. code-block:: console
$ tar zxf incenp.bioutils-|version|.tar.gz
$ cd incenp.bioutils-|version|
$ python setup.py build
$ python setup.py install
You may also clone the repository:
.. code-block:: console
$ git clone https://git.incenp.org/damien/bioutils.git
and then proceed as above.
Dependencies
============
Incenp.Bioutils requires the following Python dependencies to work:
* `Biopython <https://biopython.org>`_
* `Click <https://palletsprojects.com/p/click/>`_
* `Click-Shell <https://github.com/clarkperkins/click-shell>`_
If you install Incenp.Bioutils from the Python Package Index with `pip`
as described above, those dependencies should be automatically installed
if they are not already available on your system.
Testing the installation
========================
You can check whether Incenp.Bioutils has been installed correctly by
trying to invoke one of the command-line utilities it provides:
.. code-block:: console
$ seqtool --version
seqtool |version|
Copyright © 2020 Damien Goutte-Gattat
This program is released under the GNU General Public License.
See the COPYING file or <http://www.gnu.org/licenses/gpl.html>

81
docs/seqtool.rst

@ -0,0 +1,81 @@
*******************
The seqtool command
*******************
The ``seqtool`` command provides subcommands to perform various
operations on sequence files.
All subcommands operate on sequences specified as :doc:`Uniform Sequence
Addresses <usa>`.
The cat subcommand
==================
The ``cat`` subcommand is analog to the Unix command of the same name.
It reads sequence files and writes out another sequence file. It can be
used for several purposes:
* converting a sequence file from one format to another;
* clean up a sequence file from spurious annotations left by some
molecular biology programs;
* catenate several sequences into a single sequence (keeping all the
associated sequence features).
Examples
--------
Converting a Genbank file into a FASTA file::
$ seqtool cat genbank::file.gb -o fasta::file.fasta
Creating a new sequence by catenating several input sequences together,
and setting some annotations in the resulting sequence::
$ seqtool -o genbank::result.gb \
--name "NEWSEQ" \
--description "This is my catenated sequence" \
--division SYN \
--clean
fasta::left.fasta abi::middle.ab1 genbank::left.gb
The siresist subcommand
=======================
The ``siresist`` subcommand takes a DNA sequence as input and generate
a new DNA sequence with the same translation but using different codons.
The gateway subcommand
======================
The ``gateway`` subcommand performs *in silico* Gatewayⓡ reactions. It
automatically detects the appropriate *attB/P/L/R* sites within the two
input sequences and generates an output sequence with all the features
appropriately copied over.
The plasmm subcommand
=====================
Given the annotated sequence of a plasmid, the ``plasmm`` subcommand
generates a PDF file describing that plasmid, with a map and list of the
main sequence features.
The blast and dotter subcommands
================================
Those are wrappers for the different BLAST commands (``blastn``,
``blastp``, ``tblastn``, and so on) from the `NCBI BLAST package`_ and
the ``dotter`` command from the `SeqTools package`_.
.. _NCBI BLAST package: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
.. _SeqTools package: https://www.sanger.ac.uk/tool/seqtools/
They don’t give access to all the options of the original programs, but
their main interest is that they can be used with sequences in any format
supported by Biopython’s ``SeqIO`` module, whereas the original programs
only read files in the FASTA format.

97
docs/seqvault.rst

@ -0,0 +1,97 @@
********************
The seqvault command
********************
The ``seqvault`` command provides a command-line interface to `BioSQL`_
databases.
.. _BioSQL: https://biosql.org/
It is intended to be used with a slightly modified version of the BioSQL
database schema (provided with ``Incenp.Bioutils`` source code in the
``biosql`` directory), where every `biodatabase` is associated with a
3-letters prefix. That prefix is then used to automatically assign
accession numbers (of the form ``PRE_xxxxxx``, where ``PRE`` is the
prefix) when importing sequences into the database. However ``seqvault``
can also be used with pristine BioSQL databases.
Setting up the BioSQL database
==============================
If you don’t already have a BioSQL database (or access to one), follow
those instructions to setup one.
With PostgreSQL
---------------
Create a new PostgreSQL user account and a new database::
# createuser <username>
# createdb -O owner <username> <dbname>
Initialize the newly created database by running the provided
``biosql/biosqldb-pgsql.sql`` script::
# psql -h localhost -U <username> <dbname> < biosql/biosqldb-pg.sql
With SQLite
-----------
Create and initialize the database with the following command::
$ sqlite3 mydb.sqlite < biosql/biosqldb-sqlite.sql
Configuring seqvault
====================
Create an INI-style configuration file named ``databases.ini`` in the
``$XDG_CONFIG_HOME/bioutils`` directory, describing the BioSQL server(s)
to use with ``seqvault``. For example, to access the two databases
created above, use the following file::
[db1]
type: biosql
driver: psycopg2
host: localhost
user: <username>
database: <dbname>
[db2]
type: biosql
driver: sqlite3
database: <path/to/mydb.sqlite>
If the ``username`` user account on the PostgreSQL is password-protected,
add a ``password`` option in the corresponding section.
The ``seqvault`` program will by default connect to the first server
described in the configuration file. Use the ``-s`` option to choose
another section from the configuration file.
Using seqvault
==============
The following examples show some typical uses of ``seqvault``.
Creating a new BioSQL subdatabase named *plasmids* with the prefix
``PLM``::
$ seqvault newdb -p PLM plasmids
Importing a sequence from a file into the subdatabase::
$ seqvault add plasmids genbank::file.gb
Listing all sequences in a subdatabase::
$ seqvault list plasmids
Extracting a sequence from a subdatabase::
$ seqvault get PLM_123456

22
docs/source/incenp.bio.modelling.rst

@ -0,0 +1,22 @@
incenp.bio.modelling package
============================
Submodules
----------
incenp.bio.modelling.cc3d module
--------------------------------
.. automodule:: incenp.bio.modelling.cc3d
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: incenp.bio.modelling
:members:
:undoc-members:
:show-inheritance:

19
docs/source/incenp.bio.rst

@ -0,0 +1,19 @@
incenp.bio package
==================
Subpackages
-----------
.. toctree::
:maxdepth: 4
incenp.bio.modelling
incenp.bio.seq
Module contents
---------------
.. automodule:: incenp.bio
:members:
:undoc-members:
:show-inheritance:

78
docs/source/incenp.bio.seq.rst

@ -0,0 +1,78 @@
incenp.bio.seq package
======================
Submodules
----------
incenp.bio.seq.databases module
-------------------------------
.. automodule:: incenp.bio.seq.databases
:members:
:undoc-members:
:show-inheritance:
incenp.bio.seq.plasmidmap module
--------------------------------
.. automodule:: incenp.bio.seq.plasmidmap
:members:
:undoc-members:
:show-inheritance:
incenp.bio.seq.seqtool module
-----------------------------
.. automodule:: incenp.bio.seq.seqtool
:members:
:undoc-members:
:show-inheritance:
incenp.bio.seq.seqvault module
------------------------------
.. automodule:: incenp.bio.seq.seqvault
:members:
:undoc-members:
:show-inheritance:
incenp.bio.seq.usa module
-------------------------
.. automodule:: incenp.bio.seq.usa
:members: USA, parse_usa, read_usa, write_usa
:undoc-members:
:show-inheritance:
incenp.bio.seq.utils module
---------------------------
.. automodule:: incenp.bio.seq.utils
:members:
:undoc-members:
:show-inheritance:
incenp.bio.seq.vault module
---------------------------
.. automodule:: incenp.bio.seq.vault
:members:
:undoc-members:
:show-inheritance:
incenp.bio.seq.wrappers module
------------------------------
.. automodule:: incenp.bio.seq.wrappers
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: incenp.bio.seq
:members:
:undoc-members:
:show-inheritance:

7
docs/source/modules.rst

@ -0,0 +1,7 @@
API Documentation
=================
.. toctree::
:maxdepth: 4
incenp.bio

178
docs/usa.rst

@ -0,0 +1,178 @@
**************************
Uniform Sequence Addresses
**************************
The ``Incenp.Bioutils`` package supports the `Uniform Sequence Address`_
scheme designed and used by the `EMBOSS package`_. All command-line
tools can read and write sequences from and to a location specified by
such addresses.
.. _Uniform Sequence Address: http://emboss.sourceforge.net/docs/themes/UniformSequenceAddress.html
.. _EMBOSS package: http://emboss.sourceforge.net/what/
Principle and examples
======================
Briefly, a *Uniform Sequence Address* or *USA* is a unified way to
specify the location and optionally the format of a biological sequence.
Please see the EMBOSS document referred to above for a complete
description of USAs, including a formal specification of their syntax.
Here are some examples of USAs:
genbank::file.gb
Get all sequences in the file ``file.gb``, expected to be in the
*Genbank* format.
fasta::file.fasta[20:100]
Get the segment 20..100 from all sequences in the *FASTA* file
``file.fasta``.
fasta::file.fast[20:100:r]
Same as the previous example, but reverse-complement the segments.
genbank::file.gb:SEQ1
Get the sequence named ``SEQ1`` from the *Genbank* file ``file.gb``.
mydb:SEQ1
Get the sequence named ``SEQ`` from the database ``mydb``.
Configuration of databases
==========================
To fetch sequences from biological databases as in the last example
above, the databases to use must first be described in a configuration
file located in ``$XDG_CONFIG_HOME/bioutils/databases.ini``.
Each section in this INI-style file describes a database. The database
identifier in a USA must match the name of one of the sections in the
file. For example, the last USA above assumes the ``databases.ini`` file
contains a section named *mydb*.
Within a section, the ``type`` parameter indicates the type of database.
Supported database types are:
biosql
A SQL database using the BioSQL schema, as supported by Biopython.
expasy
The ExPASy server.
entrez
One of the NCBI Entrez database.
BioSQL databases
----------------
A section describing a BioSQL database must contain at least the
following parameters:
driver
Indicates the Python SQL driver (dependent on the underlying SQL
database server; for example, ``psycopg2`` for a PostgreSQL server,
or ``sqlite3`` for a SQLite database).
database
The name of the database. For a SQLite database, this is the path
to the database file.
For non-SQLite servers, other parameters indicate how to connect to the
server: ``host`` for the server’s hostname, ``user`` for the name of the
account on the server, ``password`` for the associated password (this
last one may be absent, if the account is not password-protected).
An optional parameter ``subdb`` may contain the name of a BioSQL
subdatabase. If that parameter is present in a section, USAs referring
to that section will only look for sequences in the corresponding
subdatabase (the default is to look in the entire database, regardless
of subdatabases).
If several sections refer to the same BioSQL server (e.g. to describe
several subdatabases in the same server), the connection parameters
(``driver``, ``database``, ``host``, ``user`` and ``password``) may be
replaced by a single ``server`` parameter containing the name of another
section in the file where those parameters are defined.
For example, assuming a PostgreSQL-based BioSQL server containing two
subdatabases named *plasmids* and *genes*, one can have the following
``databases.ini`` file::
[myserver]
type: biosql
driver: psycopg2
host: localhost
user: myuser
password: mypassword
database: mydatabase
[plasmids]
type: biosql
server: myserver
subdb: plasmids
[genes]
type: biosql
server: myserver
subdb: genes
With such a file, the USA ``myserver:SEQ1`` will look for a sequence
named *SEQ1* in all the subdatabases on the server, whereas the USA
``plasmids:SEQ2`` will look for a sequence named *SEQ2* only in the
*plasmids* subdatabase.
ExPAsY database
---------------
This type of database does not need any parameter. USAs referring to
such a database will be resolved by querying directly the ExPASy server.
It is only possible to refer to a sequence by its accession number.
Field-based queries, as described in the USA specification, are not
supported.
Example configuration::
[uniprot]
type: expasy
Example USA::
uniprot:P49450
Entrez databases
----------------
This type of database expects the following parameters:
email
The email address to send to the NCBI server along with each query.
database
The Entrez database to use. It can be ``nuccore`` for the DNA/RNA
database, or ``protein`` for the protein database.
As for the ExPASy database type, only references by accession numbers
are supported.
Example configuration::
[genbank]
type: entrez
email: myemail@example.org
database: nuccore
[gbprot]
type: entrez
email: myemail@example.org
database: protein
Example USAs::
genbank:NM_001809
gbprot:NP_001800

4
incenp/bio/seq/databases.py

@ -60,7 +60,7 @@ class DatabaseProvider(object):
* ``host`` for the hostname of the SQL server;
* ``user`` for the user account to connect to the server with;
* ``password`` for the associated password;
* ``name`` for the SQL database name;
* ``database`` for the SQL database name;
* ``subdb`` for the name of the BioSQL subdatabase, if any.
ExPASy database (``type: expasy``)
@ -281,7 +281,7 @@ class DatabaseAdapter(object):
will most likely not support it.
:return: the database records, as a list of
:class:`Bio.SeqRecord.SeqRecord` objects (or objects with a
:class:`Bio.SeqRecord.SeqRecord` objects (or objects with a
compatible interface, such as
:class:`BioSQL.BioSeq.DBSeqRecord`)
"""

10
setup.py

@ -60,5 +60,13 @@ setup(
'seqvault = incenp.bio.seq.seqvault:seqvault',
'cc3d-runner = incenp.bio.modelling.cc3d:main'
]
}
},
command_options={
'build_sphinx': {
'project': ('setup.py', 'Incenp.Bioutils'),
'version': ('setup.py', __version__),
'release': ('setup.py', __version__)
}
}
)

Loading…
Cancel
Save