With the advancement of next-generation sequencing techniques, the discovery of novel proteins has significantly exceeded human capacity and resources for experimentally and functionally characterising proteins. Nevertheless, understanding the protein function(s) is crucial for identifying potential new drug targets as well as unravelling the protein links to pathogenic processes.
Protein 3D structures can provide important insights into this task but have been limited to those with regular protein folds. The evolution of methods for protein modelling, such as AlphaFold and RoseTTAFold, has provided the ability to structurally explore the human proteome at great resolution. Despite significant efforts in protein function prediction, protein characterisation predictive tools have primarily focused on sequence information with variable efficacy and success. In addition, those tools that utilise protein structures are commonly limited by the number of protein annotation types they can handle.
In this work, we propose LEGO-CSM, a comprehensive web-based resource that addresses this gap by leveraging well-established and robust graph-based signatures to supervised machine learning (ML) models using both protein sequence and structure information to accurately model key insights that are relevant to protein functional characterisation. LEGO-CSM's ML models can accurately predict protein functional insights in terms of subcellular localisation, Enzyme Commission (EC) numbers, and Gene Ontology (GO) terms.
We demonstrate that our models perform as well as or better than alternative approaches, achieving an Area Under the Receiver Operating Characteristic Curve (ROC AUC) of up to 0.93 for subcellular localisation, up to 0.93 for EC, and up to 0.81 for GO terms on independent blind tests. LEGO-CSM's web server is freely available at https://biosig.lab.uq.edu.au/lego_csm.