Machine learning in drug discovery
Artificial intelligence is advancing various industries, including healthcare and the pharmaceutical industry. According to Accenture data, key clinical health AI applications can potentially create $150 billion in annual savings for the United States healthcare sector by 2026.
The numbers show that the healthcare industry will heavily leverage the possibilities provided by machine learning. That’s why AI companies are getting involved in various activities in the treatment process, from diagnosis to therapy and drug development.
By applying convolutional neural networks in detecting diabetic retinopathy, deepsense.ai significantly improved the diagnostic process by speeding up and automating diabetic retinopathy screenings. The next step may be building a reinforcement learning agent that can be trained to run by controlling the muscles attached to the virtual skeleton. With that doctors can predict if a patient is able to walk, jump or run properly after the treatment. Furthermore, the work done during the research might be later used to design new, AI-powered leg prostheses.
Another healthcare segment that is heavily dependent on data is drug discovery.
The potential of AI in drug discovery
Computational solutions in drug discovery help significantly reduce the cost of introducing drugs to the market. Grand View Research and its new 2018 report implies that global drug discovery informatics market size was estimated at $713.4 million in 2016 and it is anticipated to progress at a CAGR (Compound Annual Growth Rate) of 12.6% by 2025. With artificial intelligence being used in drug discovery, the market’s value is growing rapidly. In its Global Artificial Intelligence in Drug Discovery Market Size Analysis, 2018-2028, Bekryl indicates that AI has the potential to create $70 billion in savings in the drug discovery process by 2028.
The technological and paradigm shift to machine learning seen in the pharmaceutical industry enables researchers to use novel computational algorithms to support the process. As biomedical data are highly complex, using algorithms in designing new drugs has become more possible than it has ever been. Machine learning can enhance many stages of the drug discovery process:
- preliminary but crucial stages including designing a drug’s chemical structure.
- investigating the effect of a drug – both in basic preclinical research and clinical trials, in which a lot of biomedical data is produced. Finding new patterns in those data can be facilitated by machine learning.
There are different kinds of data, including genetic and imaging ones. Each of them can be analyzed with machine learning and further used to build novel solutions for drug discovery.
Challenges in machine learning for drug discovery
Ensuring drug safety is one of the main challenges in the drug discovery process. Interpreting information of the known effects of drugs and predicting their side effects are complex tasks. Scientists and engineers from research institutions and pharmaceutical companies like Roche and Pfizer have been trying to use machine learning to get meaningful information from clinical data obtained in clinical trials. Interpretation of this data in the context of drug safety is an active area of research.
Clinical trials are the most expensive stage of drug development. To reduce their costs, it is crucial to use the experience gained during previous clinical trials in the early stages of drug development. This can be achieved in two steps:
- biomedical data from research experiments could be analyzed and interpreted using machine learning to predict a drug’s effects and side effects;
- data from clinical trials analyzed with machine learning should support the interpretation of biological data.
With those two approaches developed simultaneously, it is possible to design better preclinical experiments to come up with the most effective therapies with the fewest side effects.
Integrating biomedical data with computational approaches
Machine learning could help optimize therapy by integrating biomedical and clinical data with computational models, and can be used to build software to test drugs and combinatorial therapies. Some computational models and approaches which support the integration of clinical data are still under development but there are also a few very good examples of successful data integration in biology and medicine.
For example, there are a number of machine learning methods for integrating genetic regulatory networks and pathway information. This can be used to predict their biological functions and efficient Python-based implementation of bioinformatic tools and approaches that are easy to interface with broadly used machine learning packages.
Genetic data analysis and personalized medicine
Many pharmaceutical companies and startups are focused on genetic data interpretation and personalized medicine. Understanding the patient’s genetic profile helps to offer appropriate drugs and therapy. Building computational approaches to analyze genetic data and propose novel therapies could be advanced with machine learning. There are only a few examples that impact current clinical practice based on machine learning solutions which bring huge potential to personalized medicine and drug discovery. They include discovering novel biomarkers of drug response and machine learning-based computational tools used in clinical practice. Such tools are used to estimate the resistance to individual drugs and to combinatorial therapies based on genotype analysis.
One of the possible approaches is based on interpreting the genetic code as a one dimensional image and then applying a standard machine learning algorithm. The data is then scoured for patterns and anomalies, just as has been done in various other deepsense.ai image recognition projects. Analyzing the genomics may be in fact done in the same way as it is applied to classical paintings, when it comes to finding a hand or any other element. For the algorithm, the nature or shape of an image to analyze is irrelevant, so the machine is equally effective at analyzing a one-dimensional DNA chain or any other type of image data
Because genomic data is usually presented as a string of letters, it is also possible to apply Natural Language Processing techniques. One advantage of doing so is that it broadens the area the algorithm is able to process. That may be important when particular changes or patterns are being sought, or the pattern to find consists of a longer sequence of genes.
A big challenge is to fully unlock the potential of machine learning for drug discovery and personalized medicine. Time series data could be useful to fully reconstruct genetic networks on the basis of expression data. To build comprehensive predictive models based on machine learning, expression genetic data and sequencing data should be acquired in time series.
Innovative startups, like Cambridge Cancer Genomics, use machine learning to analyze data gained from liquid biopsy, a diagnostic technology in which circulating tumor cells or cell-free DNA is collected from blood samples. Although it is not a fully standardized approach for cancer therapy monitoring, it is highly anticipated in personalized medicine due to its ability to acquire genetic data in time series during treatment. Applying machine learning to better understand those data and to answer the question of why cancer evolves could help scientists design less toxic therapies.
Building and getting insight from databases and datasets
Scientists use public repositories of clinical data to tackle big problems in clinics to help medical doctors in their everyday work, as medical knowledge can be extracted from public repositories. These repositories could also be used for drug discovery purposes to include clinical information in the early stage of drug development.
Attempts have been made to represent medical knowledge using deep neural networks. Data mapped with machine learning might also be easier to integrate with biomedical data analyzed with machine learning, thanks to better compatibility in the data structures generated with similar approaches.
New achievements in building databases for machine learning purposes are also promising. For example, the authors of the paper “integrative analysis and machine learning on cancer genomics data using the Cancer Systems Biology Database (CancerSysDB)” developed a database for highly flexible queries and analysis of cancer-related data across multiple data types and multiple studies. However, there are many problems in medicine and drug discovery which are very difficult to answer only on the basis of public data, of which there is a paucity if better machine learning models and approaches are to be developed.
If proper datasets are to be built to answer specific scientific questions, it is not only the way in which data is preprocessed that must be understood, but also the principles of using different bioinformatic tools and interdisciplinary knowledge in biomedicine and where computer science and medicine converge. Teams which have this knowledge and skills could help to make better use even of limited amounts of data from public repositories. Machine learning engineers usually get data from scientists, medical doctors, pharmaceutical companies and hospitals, thus the amount is limited. But models must be strong for results to be achieved.
One of the best examples of designing a model of superior strength, one that can deal with the lack of proper data, was deepsense.ai’s Right Whale Recognition engine. The model was designed to recognize an individual Right Whale in a photograph, even if there were only a few photos provided in the dataset.
To get deep insight from data, close cooperation and mutual understanding of different languages and disciplines is needed. That is difficult if there is only the occasional consultation.
Scientists with a dual biomedical and computational background are crucial members of teams building databases, datasets, machine learning models, tools and software for analyzing biomedical data and drug discovery. Leading institutions like ETH Zurich have already started educating a new generation of medical scientists with a computational and math background and have built a platform and interdisciplinary teams to analyze clinical and biomedical data. The Swiss tumor board and ETH Personalized Health Technologies Platform Nexus are actively working towards implementing individualized, biomarker-based medical decisions in clinical practice. These are crucial fundamental steps in advancing drug discovery with machine learning.
Standard machine learning approaches for genetics and genomics
Standard supervised, semi-supervised and unsupervised machine learning algorithms are applied to analyze genetic data like microarray or RNA-seq expression data. To understand how, read “machine learning in genetics and genomics”. These algorithms can reveal disease and healthy phenotypes and could be further used to uncover the mechanisms of action of drugs. In any application of machine learning methods, the researcher must decide which data to provide as input to the algorithm to answer complex biomedical questions.
There are a number of comprehensive reviews summarizing the use of large-scale analysis of genomic data and machine learning strategies to solve genomic sequencing problems, like finding specific regions in sequences and recognizing locations of transcriptomic sites. It is one of the biggest challenges in genomics with practical applications.
Machine learning has potential for this application, though the results produced with machine learning algorithms should be validated with data from laboratory experiments or clinical trials. Deep learning algorithms could be useful in genome interpretation and analysis of genetic variants, a complex task that requires a combination of robust biological data and clinical knowledge.
Recently scientists and engineers have taken a step toward better understanding the human genome thanks to machine learning. Supervised heterogeneous ensemble methods can significantly improve our ability to address difficult biomedical prediction problems. Still, the application of machine learning algorithms to genomic problems is in a nascent stage. After all, genomic and genetic data are multidimensional and there remains a need to develop probabilistic machine learning algorithms for their analysis.
Machine learning approaches for network analysis of biomedical data
Analysis of genetic data could be helpful in elucidating genetic networks, which can reveal a drug’s mechanism of action and help understand how diseases work. This falls within the scope of an emerging new discipline called network medicine. The Barabasi group, a pioneer in network medicine, states that an unsupervised network-based approach enables the prediction of novel drug-disease associations, which offer significant opportunities for finding new applications for drugs and predicting potential side effects.
The group also found that the therapeutic effect of drugs might be localized in a small network neighborhood. This means that several genes in close network proximity of genes related to the mechanism of a disease could be targeted to effectively treat the disease.
Analyzing genetic network data with machine learning could help in finding novel targets for drugs and predict the optimal combination of drugs. There are research papers that explain how to benchmark machine learning for biological network analysis. One is “machine learning-assisted network inference approach to identify a new class of genes that coordinate the functionality of cancer networks.” This study shows usage of support vector machine (SVM) models combined with machine learning-assisted network inference (MALANI) to identify cancer-associated gene pairs. These can be used to reconstruct cancer networks to identify key cancer genes in high-dimensional data space that would otherwise go undetected by conventional approaches. These algorithms should be equally applicable to other machine learning and feature selection approaches. There is also a tutorial by Stanford lecturers which shows the basics of how to use deep learning approaches to analyze biological networks. However, for analysis of complex biological networks, non-standard machine learning algorithms are still being developed and network and machine learning approaches need better integration.
Machine learning algorithms in image analysis for drug discovery
The article Machine learning and image-based profiling in drug discovery presents how image-based screening of high-throughput experiments, in which cells are treated with drugs, could help elucidate a drug’s mechanism of action. It is mentioned that unsupervised and simple statistical inference methods seem to be in favor for analyzing image data from large-scale profiling experiments, but complex biological phenotypes and single-cell experiments could be successfully classified with supervised algorithms.
The recently explored application of supervised learning in image-based profiling, particularly deep neural networks, might be a novelty detection framework to identify unexpected phenotypes revealed in the drug discovery process. With deep learning it is possible to predict the properties of a molecule only from its structure. The technique requires using a convolutional neural network that is able to extract the shape of a molecule and then confront it with the information gathered about the properties.
Novel machine learning algorithms under way
Research on quantum machine learning shows that this approach should be useful for finding complex patterns in data. As biological and medical data are complex, probabilistic quantum machine learning algorithms represents a real opportunity to understand them better. Innovative pharmaceutical companies like Amgen or startups like ProteinQure have moved to apply quantum computing and quantum machine learning to drug discovery, while focusing these efforts mainly on predicting the structure of new drugs. Finally, genomics and systems biology are two important areas in which novel machine learning algorithms can be applied with a view to producing less toxic drugs based on the profound analysis of biomedical data.
The text was written in collaboration with Anna Kornakiewicz, an independent data scientist and researcher as a consultant.