Conceptualization

This demo final project for COE 332 is called PSSP API. It is a REST API interface to a scientific code base. The scientific code used here is a protein secondary structure prediction (PSSP) tool called Predict_Property.

Background

Protein secondary structure, or the local folded structures that form within a protein, can be predicted somewhat successfully from the primary amino acid sequence. Researchers have developed tools that take amino acid sequence as input (e.g. ‘AAAAAAA’), and return the likelihood of different folded structures occurring at each position.

Many of these tools are Linux command line tools. It would be useful and interesting to have an intuitive REST API for calling one of these tools, so that users can perform these prediction calculations without necessarily having command line experience or without installing the tool themselves. Further, this is a first step toward encapsulating the function of the scientific code base into a web interface.

Scope

The scope of this project is narrow. It should be designed to expect only one kind of input: protein primary sequences as a string of letters. It performs the same standard analysis using the Predict_Property command line tools each time. The expected results returned always follow the same 8-line format shown below (# annotations on each line not included with the result):

> Header Info     #-> header / metadata including job id
ASDFASDGFAGASG    #-> user input sequence with invalid amino acid shown as 'X'.
HHHHEEECCCCCHH    #-> 3-class secondary structure (SS3) prediction.
HHGGEEELLSSTHH    #-> 8-class secondary structure (SS8) prediction.
EEMMEEBBEEEBBM    #-> 3-state solvent accessibility (ACC) prediction.
*****......***    #-> disorder (DISO) prediction, with disorder residue shown as '*'.
_____HHHHH____    #-> 2-class transmembrane topology (TM2) prediction.
UU___HHHHH____    #-> 8-class transmembrane topology (TM8) prediction.

Here are a few tables for interpreting the results:

SS3

SS8

Key

Value

Key

Value

H

a-helix

H

a-helix

E

b-sheet

G

3-helix

C

coil

I

5-helix

.

.

E

b-strand

.

.

B

b-bridge

.

.

T

turn

.

.

S

bend

.

.

L

loop

ACC

DISO

Key

Value

Key

Value

B

buried

*

disordered

M

medium

_

not disordered

E

exposed

.

.

TM2

TM8

Key

Value

Key

Value

H

transmembrane

H

transmem helix

_

not transmembrane

E

transmem strand

.

.

C

transmem coil

.

.

I

membrane-inside

.

.

L

membrane-loop

.

.

F

interfacial helix

.

.

X

unknown localizations

.

.

_

not transmembrane

User Interface

With this API, users should have access to a number of curl routes to GET information about the service, and about past jobs that have been run. E.g.:

curl localhost:5041/             # general info
curl localhost:5041/run          # get instructions to submit a job
curl localhost:5041/jobs         # get past jobs
curl localhost:5041/jobs/JOBID   # get results for JOBID

Users should also be able to POST protein primary sequences (e.g ‘AAAAAAA’) to a defined route, and in return they will receive a Job ID.

curl -X POST -d "seq=AAAAA" localhost:5041/run

Technologies / Architecture

There will be two “environments” used to develop, test, and deploy this API. They will be referred to as:

  • The Development Environment refers to the class ISP server. This env will be used to develop new features / routes in the source code. Containers will be built and deployed using docker and docker-compose commands. Services will be tested by directly connecting to the containers.

  • The Deployment Environment refers to the Kubernetes cluster. This env will host both a testing (also called “staging”) and production deployment of the full API. No code edits will take place in this environment. It will exclusively pull pre-built / tagged containers from Docker Hub for the runtime.

Note

Our deployment environment is a local Kubernetes Cluster, but it could just as easily have been AWS, Azure, Google Cloud, etc.

../_images/architecture.png

Design diagram

The different components of this environment will be described in the following pages.