Case study

Using Contextual Causal Data To Explore Lung Cancer - Part 2

Introduction / Situation

The lung cancer knowledge graph generated by Galactic AI™ provides an accurate, up to date and comprehensive overview of the existing data available regarding lung cancer. The graph consists of 71K distinct directed molecular interactions and 7.8K distinct proteins, all documented in the context of lung cancer research. Each protein present within the graph represents a potential disease specific target, however uncovering the targets with the most promise remains a difficult challenge. In this analysis we demonstrate the value of ranking targets as key points of intervention through different statistical methods and measuring these against known lung cancer targets.


The first step to identifying potential novel lung cancer targets required overlaying existing lung cancer drug targets from ChEMBL, a database containing data regarding bioactive molecules with drug-like properties. ChEMBL includes 2,960 distinct drugs and 1,369 indications, of which 191 single proteins targets are approved for use across lung cancer indications. Of the existing lung cancer targets, 175 were found within the lung cancer knowledge graph generated by Galactic AI™.

With the knowledge of existing lung cancer targets and their positions within the knowledge graph, we then to sought to test various ways of scoring and ranking proteins and their likelihood to represent a lung cancer drug target. In each testing method, or scorer type, a positive prediction was a known lung cancer target. Each scorer was converted to a classifier by choosing a threshold for positive predictions. The target proteins were scored and ranked through a series of different methodologies (Table below), initially independently and then in combination.

Centrality Measures – We investigated an array of out of the box graph theory measures, with many reporting 14% average precision.

Enrichment – The enrichment of proteins present within the lung cancer graph is calculated by comparing against the full causal database using the hypergeometric distribution. Targets that are more heavily regulated show greater average precision.

Max relevance – Max relevance incorporates the contextual strength of lung cancer against the causal data. On its own max relevance shows marginal performance in predicting lung cancer targets.

Max confidence – Max confidence incorporates the confidence of the software to accurately curate what was documented in the research. As with relevance, confidence is only marginally effective as a predictor when used in isolation.Individually each scorer is valuable in ranking the target proteins. However, the most accurate scorer incorporated a combination of eigenvector centrality, enrichment, max relevance and max confidence, demonstrating 18% average precision and 81% AUROC.It is important to consider that false predictions are targets that are not currently indicated for lung cancer and may represent novel opportunities. These novel opportunities may include targets for therapies used in non-cancer indications. The next case study will explore the results in more detail and their potential biomedical implications.

Impact & Benefit


The lung cancer knowledge graph automatically generated by Galactic AI™ incorporates 91% of approved lung cancer targets in ChEMBL, without any input from any existing structured databases.

Enhanced predictive power

Through combining several different scoring methodologies Galactic AI™ can provide robust predictions and improve average precision for target prediction by 22%.


The classifiers and methodologies used to rank targets are derived from contextual causal data that is disease agnostic, which can be easily modified to other therapy areas without manual intervention.

Thank you for reading

Download this case study as PDF

How Can We Help You?

Get in touch with us to find out how we can transform your R&D.
Contact us