A3D-MODB Database | Solubility and Aggregation Properties
|
The A3D database provides the analysis of solubility and aggregation properties for protein structures from the AlphaFold Database.

About

Protein aggregation is a critical concern linked to severe human ailments and poses challenges in therapeutic protein production. The expansion of protein structure insights has surged with the AlphaFold Database (AF), propelled by deep-learning advancements. By applying A3D analysis to the diverse spectrum of 12 species, we extend the perspective beyond human-centric approaches. This database enhances insights into protein aggregation properties and facilitates rational improvement of protein solubility and design across various organisms.

Aggrescan3D (A3D) is a structure-based algorithm that uses three-dimensional protein coordinates to project each amino acid's intrinsic aggregation propensity value into the structure. Then, the aggregation score is corrected as a function of its specific solvent exposure and the aggregation propensity of neighboring residues within a 10 Å sphere of radius. Given this dependence on the spatial position of each atom of the protein, both the atomic resolution and the biological relevance of the input structures impact A3D prediction's accuracy.
Therefore, in order to correctly interpret A3D database aggregation tendencies, users might consider two critical characteristics of the AlphaFold predictions:
• Structure confidence and disorder
• Quaternary structure context
(see details in the next tab: Source of the structural data: AlphaFold predictions)

Structure confidence and disorder

AF predicted structures cover 12 proteomes included in this database and involve complete protein chains in their monomeric state. Each amino acid in an AF model has a confidence score (pLDDT) that ranges from 0 to 100, indicating which regions could be considered equivalent to an experimentally determined structure. AF predicted structures include regions with low pLDDT scores (PMID) which generally corresponds to the highly flexible parts of the structure. These permanently or transiently disordered segments are often not defined in experimental structures, and therefore, till now, A3D was unaware of them. Now they are incorporated in the A3D Database. A3D predictions are blind to the dynamic nature of these regions since AF models correspond to static frames of the conformational ensemble of these regions. Therefore, the presence of disordered low AF confidence regions might artificially increase the presence of S-APRs* in adjacent high confidence protein domains.
* (Surface-exposed Aggregation Prone Regions)

To deal with this issue, the A3D dataset includes two additional precalculated structural aggregation profiles per default, which correspond to predicted structures with pLDDT thresholds: >70% and >50% confidence, which were defined after manually curating the 100 randomly selected AF models.

In the following points, we outline different situations users may face:
• Proteins with high confidence values: the overall AF predicted structure possesses high pLDDTs. All three computed structures converge into a unique aggregation prediction. Well-defined S-APRs can be delineated from A3D output structures. (representative examples: Q68D91, S4R460)
• Proteins with limited low confidence regions: proteins that display local surface modifications with confidence thresholds of >70% or >50% pLDDT. An increase or decrease of localized S-APRs can be observed.
(representative examples: P09382, Q9UKL6, Q5EE01)
Multidomain proteins: proteins with two or more globular domains that correspond to well-defined and confident regions, while tethering elements that connect globular domains possess low pLDDT values. The analysis with restricted pLDDT thresholds often results in disconnected domains and, as a consequence, some previous sheltered APR might become exposed. As a result, S-APRs might diverge in the tree models, and their boundaries are less evident.
(representative examples: Q96MN5, Q96DA0)
• Proteins with extended low-confidence regions: proteins with large low confidence regions localized in flexible C-, N- termini, or long loops, usually displayed around a well-defined globular core. With the 70% and 50% pLDDTs cutoffs, disordered regions are often excluded from the model and, therefore, not considered in A3D prediction. As a result, the A3D predictions correspond to well-defined globular domains. Note that we might observe the presence of free amino acids in some cases, unattached to the protein main chain, essentially because they are in the exclusion limit between two pLDDT thresholds.
(representative examples: Q9NRE2, Q8NFW1)
• Proteins with overall low confidence: usually, short polypeptides that are mostly disordered, exposing most of its surface to solvent. In some cases, we lose all the structure in predictions corresponding to >70% or >50% pLDDT threshold.
(representative examples: Q8N9P0, Q8TEV8, Q9GZY1)

Quaternary structure context

In the AF database, structure predictions are restricted to single chains. This precludes the analysis of the quaternary structure context in the A3D Database. The overlap between the physicochemical properties governing protein-protein native interactions and non-native contacts triggering aberrant self-assembly implies that protein interfaces are enriched in S-APRs. In proteins displaying quaternary structure, this results in an over-prediction of S-APRs in protein monomers, relative to native the same subunits in the oligomeric state.

Gathering important information

We strongly encourage users to compile as much as possible information about their case study protein to exploit A3D predictions.
Some helpful questions to shape the output would be:
• (i) Does the protein possess disordered regions?
• (ii) Do these disordered segments correlate with low pLDDT?
• (iii) Does the protein bind another protein? Does it form a homo- or hetero-oligomer?
• (iv) Does the protein contain a signal peptide?
• (v) Is the protein the mature form or on the contrary, is a proprotein?

In addition to the aforementioned pieces of advice, we also recommend reviewing dedicated bibliography on how to manage the A3D algorithm and how to evaluate its aggregation predictions:
1. AGGRESCAN3D: Toward the Prediction of the Aggregation Propensities of Protein Structures, Methods Mol Biol. 2018, 1762:427-443.
2. Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility, Nucleic Acids Res. 2019, 47(W1):W300-W307.
3. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures, Nucleic Acids Res. 201, 43(W1):W306-13.
4. Computational prediction of protein aggregation: Advances in proteomics, conformation-specific algorithms and biotechnological applications, Comput Struct Biotechnol J. 2020, 18:1403-1413.

The database encompasses a comprehensive view of aggregation properties, intricately linked to protein functionality. This aggregation propensities were projected through the A3D algorithm with protein structures predicted by the AlphaFold2 technique across a spectrum of 12 diverse organisms. Employing meticulous UniProt annotations, the database intricately annotates signal peptides and membrane proteins, further enriched by TOPCONS-derived membrane predictions. With a total of over 140,000 unique UniProt entries covered, this database serves as an invaluable repository for unraveling the intricate interplay between protein structure, membrane interactions, and aggregation tendencies across a wide spectrum of biological systems.

Common NameSpeciesTotal EntriesMembraneSignal Peptide
Fruit FlyDrosophila melanogaster1345926813084
MgenMycoplasma genitalium4839323
E. coliEscherichia coli43634589954
C. elegansCaenorhabditis elegans1969456434091
S. pombeSchizosaccharomyces pombe5128897213
Baker's YeastSaccharomyces cerevisiae60391222101
RatRattus norvegicus2127044912570
MouseMus musculus2161544373092
HumanHomo sapiens2339159344360
ZebrafishDanio rerio2466433752009
Thale CressArabidopsis thaliana2743435732473
SARS-CoV-2SARS coronavirus 295103

A3D has been shown to be accurate in predicting the changes in the solubility of globular proteins upon mutation (PMID). We compared the performance of A3D when modeling the impact of mutations on the solubility on top of either the experimental structures or the equivalent AF-derived models, using the mutation editor in the A3D database. To this aim, we selected a reduced set of structurally and sequentially unrelated proteins (see the manuscript Supplementary Information, Table S2 and Figure S5).
The predictions turned out to be accurate and coincident for all the proteins, independently of the kind of structural input we used. In addition to changes in solubility, A3D provides the impact of the selected mutations on protein stability, as calculated by FoldX (PMID). Importantly, it has been shown that the quality of AF-models is sufficient to predict protein solubility changes upon mutation using FoldX (PMID).

Papers & Methods

Papers describing A3D database

1. A3D Model Organism Database (A3D-MODB): a database for proteome aggregation predictions in model organisms , submitted, 2023
2. A3D database: structure-based predictions of protein aggregation for the human proteome, Bioinformatics, 38(11):3121-312, 2022
3. A3DyDB: Exploring Structural Aggregation Propensities in the Yeast Proteome, Microbial Cell Factories 22:186, 2023..



The papers describing A3D method and its applications:

1. Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility, Nucleic Acids Research, gkz321, 2019.
2. Aggrescan3D standalone package for structure-based prediction of protein aggregation properties, Bioinformatics, btz143, 2019.
3. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures, Nucleic Acids Res., 43, W306-W313, 2015.
4. Combining Structural Aggregation Propensity and Stability Predictions To Redesign Protein Solubility, Molecular Pharmaceutics, 15(9), 3846-3859, 2018.

The example guide for A3D 2.0:
5. A3D 2.0 update for the prediction and optimization of protein solubility, Methods in Molecular Biology (biorxiv preprint), 2021.

Download

Download the results of A3D analysis

The results of the A3D analysis are available for download on the subpage dedicated to a particular entry in the A3D database.
Please use the search form, available at the A3D Database Home tab, to find the protein of your interest.
The A3D predictions of aggregation and solubility properties can be downloaded in CSV format (raw data) from the Aggrescan3D score tab, or as a graphical profile (image) from the Aggrescan3D plot tab.
The structural data can be downloaded in PDB format from the Structure tab. The images of user-taken snapshots are saved automatically in the Gallery tab and available for download.



Download the annotations

The unified and integrated metadata accompanied by referencing identifiers in the A3D database is available for download in CSV format.

A3D Database: Solubility and Aggregation Properties for the Human Proteome   A3D_Database_HUMAN.csv
A3D Database: Solubility and Aggregation Properties for the Rat Proteome   A3D_Database_RAT.csv
A3D Database: Solubility and Aggregation Properties for the Mouse Proteome   A3D_Database_MOUSE.csv
A3D Database: Solubility and Aggregation Properties for the Zebrafish Proteome   A3D_Database_ZEBRAFISH.csv
A3D Database: Solubility and Aggregation Properties for the Nematode Proteome   A3D_Database_CELEGANS.csv
A3D Database: Solubility and Aggregation Properties for the Nematode Proteome   A3D_Database_DMELANOGASTER.csv
A3D Database: Solubility and Aggregation Properties for the A. thaliana Proteome   A3D_Database_ATHALIANA.csv
A3D Database: Solubility and Aggregation Properties for the Yeast Proteome   A3D_Database_SCEREVISIAE.csv
A3D Database: Solubility and Aggregation Properties for the S. pombe Proteome   A3D_Database_SPOMBE.csv
A3D Database: Solubility and Aggregation Properties for the E. coli Proteome   A3D_Database_ECOLI.csv
A3D Database: Solubility and Aggregation Properties for the M. Genitalium Proteome   A3D_Database_MGENITALIUM.csv
A3D Database: Solubility and Aggregation Properties for the SARS2 Proteome   A3D_Database_SARS-CoV-2.csv

A3D Data for each species can be downloaded below, file names correspond to the job names from the csv files above.

A3D Database: Solubility and Aggregation Properties for the Human Proteome   HUMAN_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the Rat Proteome   RAT_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the Mouse Proteome   MOUSE_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the Zebrafish Proteome   ZEBRAFISH_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the Nematode Proteome   CELEGANS_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the Nematode Proteome   DMELANOGASTER_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the A. thaliana Proteome   ATHALIANA_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the Yeast Proteome   SCEREVISIAE_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the S. pombe Proteome   SPOMBE_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the E. coli Proteome   ECOLI_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the M. Genitalium Proteome   MGENITALIUM_SCORES.tar.gz
A3D Database: Solubility and Aggregation Properties for the SARS2 Proteome   Sars-CoV-2_SCORES.tar.gz


Laboratory of Theory of Biopolymers 2018