Structure confidence and disorder
AF predicted structures cover XX% of the yeast proteome and involve complete protein chains in their monomeric state. Each amino acid in an AF model has a confidence score
(pLDDT) that ranges from 0 to 100, indicating which regions could be considered equivalent to an experimentally determined structure.
AF predicted structures include regions with low pLDDT scores
(PMID) which generally corresponds to the highly flexible parts of the structure.
These permanently or transiently disordered segments are often not defined in experimental structures, and therefore, till now, A3D was unaware of them.
Now they are incorporated in the A3D Database. A3D predictions are blind to the dynamic nature of these regions since AF models correspond to static frames of the
conformational ensemble of these regions. Therefore, the presence of disordered low AF confidence regions might artificially
increase the presence of S-APRs* in adjacent high confidence protein domains.
* (Surface-exposed Aggregation Prone Regions)
To deal with this issue, the A3D dataset includes two additional precalculated structural aggregation profiles per default, which correspond to predicted structures with
pLDDT thresholds: >70% and >50% confidence, which were defined after manually curating the 100 randomly selected AF models.
In the following points, we outline different situations users may face:
• Proteins with high confidence values: the overall AF predicted structure possesses high pLDDTs. All three computed structures converge into a unique aggregation prediction.
Well-defined S-APRs can be delineated from A3D output structures.
(representative examples: Q68D91, S4R460)
• Proteins with limited low confidence regions: proteins that display local surface modifications with confidence thresholds of >70% or >50% pLDDT.
An increase or decrease of localized S-APRs can be observed.
(representative examples: P09382, Q9UKL6, Q5EE01)
• Multidomain proteins: proteins with two or more globular domains that correspond to well-defined and confident regions,
while tethering elements that connect globular domains possess low pLDDT values. The analysis with restricted pLDDT thresholds often results in disconnected domains and,
as a consequence, some previous sheltered APR might become exposed. As a result, S-APRs might diverge in the tree models, and their boundaries are less evident.
(representative examples: Q96MN5, Q96DA0)
• Proteins with extended low-confidence regions: proteins with large low confidence regions localized in flexible C-, N- termini, or long loops,
usually displayed around a well-defined globular core. With the 70% and 50% pLDDTs cutoffs, disordered regions are often excluded from the model and, therefore,
not considered in A3D prediction. As a result, the A3D predictions correspond to well-defined globular domains. Note that we might observe the presence of free amino acids
in some cases, unattached to the protein main chain, essentially because they are in the exclusion limit between two pLDDT thresholds.
(representative examples: Q9NRE2, Q8NFW1)
• Proteins with overall low confidence: usually, short polypeptides that are mostly disordered, exposing most of its surface to solvent. In some cases,
we lose all the structure in predictions corresponding to >70% or >50% pLDDT threshold.
(representative examples: Q8N9P0, Q8TEV8, Q9GZY1)
Quaternary structure context
In the AF database, structure predictions are restricted to single chains. This precludes the analysis of the quaternary structure context in the A3D Database.
The overlap between the physicochemical properties governing protein-protein native interactions and non-native contacts triggering aberrant self-assembly implies
that protein interfaces are enriched in S-APRs. In proteins displaying quaternary structure, this results in an over-prediction of S-APRs in protein monomers,
relative to native the same subunits in the oligomeric state.
Gathering important information
We strongly encourage users to compile as much as possible information about their case study protein to exploit A3D predictions.
Some helpful questions to shape the output would be:
• (i) Does the protein possess disordered regions?
• (ii) Do these disordered segments correlate with low pLDDT?
• (iii) Does the protein bind another protein? Does it form a homo- or hetero-oligomer?
• (iv) Does the protein contain a signal peptide?
• (v) Is the protein the mature form or on the contrary, is a proprotein?
In addition to the aforementioned pieces of advice, we also recommend reviewing dedicated bibliography on how to manage the A3D algorithm and how to evaluate
its aggregation predictions:
1. AGGRESCAN3D: Toward the Prediction of the Aggregation Propensities of Protein Structures, Methods Mol Biol. 2018, 1762:427-443.
2. Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility, Nucleic Acids Res. 2019, 47(W1):W300-W307.
3. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures, Nucleic Acids Res. 201, 43(W1):W306-13.
4. Computational prediction of protein aggregation: Advances in proteomics, conformation-specific algorithms and biotechnological applications, Comput Struct Biotechnol J. 2020, 18:1403-1413.