Machine Learning in Biology

The typical problem for biological data analysis are very large dimension of the parameters and a limited training set. This situation could be solved by using under-trained or over-trained models. Taking into account the complexity of the task, our company uses preliminary data processing to reduce the dimension of the parameters. Besides that, we invite experts in particular areas of biology and medicine to distinguish between the important and secondary aspects of the data.

The mathematical discipline has a term “Correctly Defined Task”. It means that there is a single solution of the problem, and the solution is stable for the input data. Solving the correctly defined problem is a more or less routine process, while solving the incorrectly defined problem becomes a magic with unpredictable results. Solutions found for the incorrectly defined problems usually have no correlation with the reality.

It is especially difficult to correctly define a task for the biological problems. It is not possible to be sure in advance that the solution exists in principle. There could be more than one solution, and the most important thing is that the solution might be very sensitive to the input data.

Our specialists pay very close attention to defining the task correctly and to finding a stable solution. The theoretical basis of Machine Learning is not yet well established. That fact pushes us to use an iterative approach to increase veracity of the results. We use the following techniques:

  • Going through a large number of models and their parameters to carefully select the correct one
  • Continuous and precise cross-validation of our results
  • Extensive studying of a particular knowledge domain

We put a lot of effort to make sure that our predictions are correct and can be experimentally validated. For the most difficult tasks it takes several lab validation cycles to achieve a high veracity of the results.

Back