Biological Python


Computational Biology

Biological modelling and data processing demands a computational language that is fast, flexible, extendable, error-tolerant and strong in text propcessing and statistics. Historically, Perl and R accounted for much of the code written in the biological fields; in the last 10 years, however, Python has been catching up very quickly, with the help of numpy, scipy, and scikit series of packages.


As usual, the required libraries upfront are loaded as below, automatically.

import numpy as np
import pandas as pd
import skbio as sk
import ssbio as ss
from import output_notebook, show
from bokeh.plotting import figure


One thing that still intrigues me after all these years about biology, is its ability to form 3D structures using long-chains of 1D structure, made of a limited alphabet of elements. The study of biological sequences, from purely an informatics point of view, has become the corner stone of deeper, structural and systems studies on biomolecules such as nucleic acids and proteins.

Let us create some DNA sequences to begin our exploration and do some basic analysis of the DNA.


from scipy.spatial.distance import hamming
from skbio import DNA


Sequence Analysis

An immediate question should naturally come up with the $r_1$, $r_2$ and $q_1$ sequences. How similar are they? A more curious reader may ask, how do we quantify this?

Naively, we may begin with a pair-wise scoring scheme, which simply counts the proportion of the sequence that are the same in between the two sequences:

def seq_score(seq1, seq2):
    score = 0.0
    max_i = min(len(seq1), len(seq2))
    for i, bp in enumerate(seq1):
        if i == max_i:
            if bp == seq2[i]:
                score += 1.0
    return score / max_i

seq_score(r1, q1)

In scikit, a similar method had been written for us, called the hamming distance:

from scipy.spatial.distance import hamming
hamming(r1, q1)

Obviously this method is flawed. Think about an insertion or deletion from the sequence.

q2 = DNA.concat([q1[:10], DNA('G'), q1[10:]])
hamming(q1, q2[:len(q2)-1])

Just one insertion to the sequence would shoot the hamming score up from 0 to 0.60. We need a better alignment method. This is actually the topic of one of the most cited papers in biology, which ended up as being a cornerstone of bioinformatics. It is called the Smith-Waterman algorithm, which aligns the two sequences to a hypothetical "ancestor" that contains the least possible mutations (replacement, insertions and deletions) for either children to bear their current form of sequence.

from skbio.alignment import local_pairwise_align_ssw
lpa = local_pairwise_align_ssw(q1, q2)

If we re-evaluate the hamming score of the now-aligned sequences,

hamming(lpa[0][0], lpa[0][1])

The score now appears much aligned with our thoughts.


Handy Conversions

The DNA class from skbio also provides us some convenient functions, such as

# Transcription
rna1 = q1.transcribe()

and, perhaps more useful,

# Translation
pr1 = q1.translate()

It is fair to say that the tools introduced here are also available elsewhere online without the need of writing a single line of code.

However, with the tooling introduced here, one will be able to process a huge amount of sequence data in an amazingly short span of time.



With the help of ribosomes, the result of an RNA translation gives us a 1D peptide chain like we saw in the previous example. In most cases, such 1D peptide chains actually embed rich, structural information, which "encodes" the folding patterns of a globular protein. First - load the required Python libs.

# %%HIDE%% 
from ssbio.core.protein import Protein
import tempfile

Now we load the mystery peptide element onto the model,

root_dir = tempfile.gettempdir()
prot_id = 'UNK5'
prot3 = Protein(ident=prot_id, root_dir=root_dir, pdb_file_type='mmtf')
prot3.load_manual_sequence(seq=prot_seq, ident='WT', 
    write_fasta_file=True, set_as_representative=True)

Find the protein PDB ID by doing sequence alignment (BLAST)

prot3.blast_representative_sequence_to_pdb(seq_ident_cutoff=0.9, evalue=0.00001)

Download the PDB in mmtf format, and output some basic information into a dataframe.


Display the file in the embedded protein visualiser nglviewer.

import nglview as nv
prot_file = prot3.structure_dir + '/' + prot3.df_pdb_metadata['structure_file'][0]


Can you try displaying the structure of the following gene?

    1 gcgagtgtgg atttctcagt taaccgagaa cggtcacgct gctgctgtcg aggaaggaaa
   61 ataaaataat tttgctggcc aactggtcga tttaaacggg gaaatgcaga tgcaaacgta
  121 caaactggtg gtcgtcggcg gcggcggcgt gggcaagtca gcgataacga tacagttcat
  181 ccagagctac ttcgtcacgg actacgatcc caccattgaa gactcgtaca cgaagcagtg
  241 caacatcgac gatgtgccag ccaaattgga cattttggac acggctggcc aggaggagtt
  301 cagtgccatg cgggagcagt acatgcgctc cggcgaggga ttcctgctcg tcttcgcgct
  361 caacgatcat tccagcttcg atgagatccc caagttccag cgccagatac tgcgcgtcaa
  421 ggatcgcgac gagttcccca tgctgatggt gggtaacaag tgcgacctga agcaccagca

Image Analysis

With the explosive use of imaging facilities, the amount of pixelated data required for processing have also grown exponentially. No longer can we process the imaging data manually i.e. checking the data through our naked eyes, tagging them through Photoshop, etc. We need something much more clever and consistent to quantify and analyse the batches of images we acquire from the experiments.

In the following section we will go through a quick tasting course on image analysis.


What is an image?

From a geometric point of view, an image is the colour-mapped expression of a function in a defined, continuous 2D region,

$$ {R,G,B} = \textrm{Image}(x, y), $$ where $R$, $G$, and $B$ represent the depth of the red, green and blue colours. A simple image can therefore be easily composed by playing with the $\textrm{Image}$ function, such as below

N = 100
x = np.linspace(0, 10, N)
y = np.linspace(0, 10, N)
xx, yy = np.meshgrid(x, y)
img = np.sin(xx)*np.cos(yy)

p = figure(x_range=(0, 10), y_range=(0, 5), plot_height=300,
           tooltips=[("x", "$x"), ("y", "$y"), ("value", "@image")])

# must give a vector of image data for image parameter
p.image(image=[img], x=0, y=0, dw=12, dh=12, palette="Spectral10")


Curious readers may want to check the values of xx, yy and try alter the function img(xx,yy) to see what the impacts may be. What does the variable N do?


Real Images

The imaging data we obtained from optical sources are of course not as simple as the mathematical functions shown in the previous slide. Each pixel at $(x, y)$ has its own RGB value, hence the name bitmaps (vs. vector graphics). The most widely used bitmap format are 24-bit RGB (8-bit - 256 depth levels for each colour), which is capable of holding a total of 16.8 million colour tones.

As long as a picture is not random noise, it is generally possible to devise features and patterns from exploring the chunks of bitmap data to perform "intelligent" operations, such as object detection, boundary finding, tagging, area calculation, number/text recognition and of course, face recognition. Most involve regressing noisy pixelated data onto a recipe of simpler mathematical functions so that a feature, pattern or model can be extracted and used for prediction.

Let's look at some simple examples.



Image segmatation is the task of labelling the objects of interest in an image. Human eyes do so by detecting the colour contrast of the object boundaries from their background. The algorithm relies on the same principle.

from skimage import data, color, feature
import matplotlib.pyplot as plt
coins = data.coins()

Histogram of colour intensity

From the histogram below (binned on 256 shades of grey), we can see that this picture has a rather rich spectrum of mid-range colours. This makes any attempt to segment the image simply from the colour depth quite difficult.

h = plt.hist(coins.ravel(), bins=256)

Segmentation by edge detection

Below is an edge-based segmentation approach. You should try to run it and assess its level of success. After that, you should try to comment on each line of the code (from line 3) what it does as you type in. One way of researching into what some code does, is to simply comment it out!

from skimage.feature import canny
from scipy import ndimage as ndi
edges = canny(coins/255.)
fill_coins = ndi.binary_fill_holes(edges)
label_objects, nb_labels = ndi.label(fill_coins)
sizes = np.bincount(label_objects.ravel())
mask_sizes = sizes > 20
mask_sizes[0] = 0
coins_cleaned = mask_sizes[label_objects]

Segmentation by region detection

Here is an alternative method for image segmentation.

Step 1 - Find high-contrast borders

from skimage.filters import sobel
elevation_map = sobel(coins)
fig, ax = plt.subplots(figsize=(8, 6))
ax.imshow(elevation_map,, interpolation='nearest')
ax.set_title('elevation map')

Step 2 - Find markers for coin areas

markers = np.zeros_like(coins)
markers[coins < 30] = 1
markers[coins > 150] = 2
fig, ax = plt.subplots(figsize=(8, 6))
ax.imshow(markers,, interpolation='nearest')

Step 3 - Fill in the markers

from skimage import morphology
segmentation = morphology.watershed(elevation_map, markers)
seg_binary = segmentation - 1
fig, ax = plt.subplots(figsize=(8, 6))
ax.imshow(segmentation,, interpolation='nearest')

Step 4 - Show the results

Now we can show the results using different ways.

from skimage.color import label2rgb

seg_filled = ndi.binary_fill_holes(seg_binary)
labeled_coins, _ = ndi.label(seg_filled)
image_label_overlay = label2rgb(labeled_coins, image=coins)

fig, axes = plt.subplots(1, 2, figsize=(8, 3), sharey=True)
axes[0].imshow(coins,, interpolation='nearest')
axes[0].contour(seg_filled, [0.5], linewidths=1.2, colors='y')
axes[1].imshow(image_label_overlay, interpolation='nearest')

for a in axes:


Step 5 - Count the coins

Having done a lot of work, we may now come to a practical question. How many coins are there? How to count the coins?

Our eyes are good at this. Is our algo good, too? Let's try.

from skimage import measure
cleared_coins = morphology.remove_small_objects(labeled_coins, 64)

Is it correct?


The End.