Machine Learning in Biology — Bioinformatics

“Bioinformatics is the science of collecting and analyzing complex biological data”

Bioinformatics combines elements of biology and computer science, using computational methods to analyze biological data. Since 1953 and the discovery of DNA the field of molecular biology has experienced a steep rise in the amounts of data to process. Scientists have been gathering large amounts of data of genome sequences from many species. Nowadays data analyses in bioinformatics predominately focus on large data sets such as macromolecular structures, genome sequences, etc…

Predicting the three-dimensional structure of proteins by their building blocks

This is a three-dimensional construction of a protein. Proteins have a structure-function relationship, understanding how the building blocks interact is important and complex.

Comparing similar sequences between different species

This is a snippet from a protein found in different species, it shows the similarities between the sequences

Pathogenomics — Fighting Viral infections

HIV develops immunity to drugs very fast and efficiently. Drugs used for treatment become obsolete incredibly fast, for that reason modern medicine conceived the “cocktail”. Mixing three or more different drugs that prevent the virus from spreading itself. However, for this to work we have to ensure the virus is not already resistant to one of the drugs.

This genome segment was taken from a strain of virus resistant to a certain drug.This genome segment was taken from a strain of virus not resistant to the same drug

if this seems simple, here are a few facts we have to remember. HIV genetic code is comprised of nine genes each one could be as long as 15000 bases (15kb). If you’re comparing human DNA samples, the human genome contains about 3 billion base pairs (3bps).

These are only a handful of examples showing why biologists needed to use computer science tools. Additionally, scientists around the world are constantly gathering genomic data from different organisms and upload their results to online databases such as BLAST (Basic Local Alignment Search Tool) and GO (Gene Ontology). However, people don’t always know what exactly the sample they uploaded is related to if it’s a gene or a regulatory sequence, etc…

Types Of Learning

Supervised

When you have prior knowledge about the data. Allows us to create predictions about new data (sample).

The machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.

Unsupervised

Without prior knowledge about the data. Allows us to discover new classes. I will not go over unsupervised learning in this post.

Type of machine learning algorithm used to draw inferences from data sets consisting of input data without labeled responses.

In this example, the red dots are “labeled” as samples from tumors while the yellow dots are “labeled ” as healthy. So we could use supervised learning to predict whether a new sample is from a healthy or not healthy sample. According to where the new sample is situated in the space we could predict from which type of sample it came from. If it’s situated on the right side it comes from a sick sample and if it’s situated on the left it is healthy. However, there are usually thousands of samples and more than 2 gene expression profiles tested making our raw data matrix multidimensional and more complicated.

Classifying Algorithms

We have labeled data set, what algorithms can we use? Decision trees, KNN (K Nearest Neighbors), SVM (Support Vector Machine), NN (Neural Networks), and many more… Which algorithm you use depends on how the raw data is structured.

Decision trees: If your data shows complexity, decision trees will work well and it’s usually easier to grasp the dividing rules.

Decision tree — Example from medicine

SVM: In SVM we find a hyperplane that divides the space into two, the space can be multidimensional and the function that divides our data does not have to be linear. This algorithm can’t always perfectly classify using our training set so you will have to choose the best separation. Also, the further away the sample is from the function the more confidence we have in the classification.

KNN: classifies a new sample by the ‘K’ amount of samples around it. This algorithm works locally and tests only in the sample’s immediate area. The main advantage of using this algorithm is the fact it allows us to analyze data sets that are very complex and hard to explain.

Performance Evaluation

Because different algorithms will behave differently between different data sets, we have to evaluate our classifier that was created. To do so we divide the data set into smaller data sets and check how it performs on one of them (the “testing set”). We then measure the classifier’s success by checking its accuracy, sensitivity, and specificity on the rest of the data (“training” set).

Accuracy = (TP+TN)/(TP+TN+FP+FN), Sensitivity = TP/(TP+FN), Specificity = TN/(FP+TN). Sadly there will always be trade-offs between how sensitive, accurate and specific our algorithm will be so it’s up to you as a scientist to decide which is more important. While we train the algorithm we need to be careful and avoid overfitting our algorithm to one training set, the way we make sure we don’t overfit our algorithm is by performing cross-validation.

Cross-validation

In cross-validation, we create several different testing sets from our entire data set. This creates different training sets within our data to evaluate the performance of the algorithms. You can test different models, summarize the overall performance (sensitivity, accuracy, and specificity) for every training and pick the best one.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store