Aggrescan3D: Project manager

Aggrescan3D (A3D) is a structure-based algorithm that uses three-dimensional protein coordinates to project each amino acid's intrinsic aggregation propensity value into the structure. Then, the aggregation score is corrected as a function of its specific solvent exposure and the aggregation propensity of neighboring residues within a 10 Å sphere of radius. Given this dependence on the spatial position of each atom of the protein, both the atomic resolution and the biological relevance of the input structures impact A3D prediction's accuracy.
Therefore, in order to correctly interpret A3D database aggregation tendencies, users might consider two critical characteristics of the AlphaFold predictions:
• Structure confidence and disorder
• Quaternary structure context
(see details in the next tab: Source of the structural data: AlphaFold predictions)

Structure confidence and disorder

AF predicted structures cover 98.5% of the human proteome and involve complete protein chains in their monomeric state. Each amino acid in an AF model has a confidence score (pLDDT) that ranges from 0 to 100, indicating which regions could be considered equivalent to an experimentally determined structure. It is estimated that around 30% of the amino acids of the human proteome are located in non-globular regions, such as intrinsically disordered regions. Therefore, it is not surprising that AF predicted structures regularly include regions with low pLDDT scores (PMID). These permanently or transiently disordered segments are often not defined in experimental structures, and therefore, till now, A3D was unaware of them. Now they are incorporated in the A3D Database. A3D predictions are blind to the dynamic nature of these regions since AF models correspond to static frames of the conformational ensemble of these regions. Therefore, the presence of disordered low AF confidence regions might artificially increase the presence of S-APRs* in adjacent high confidence protein domains.
* (Surface-exposed Aggregation Prone Regions)

To deal with this issue, the A3D dataset includes two additional precalculated structural aggregation profiles per default, which correspond to predicted structures with pLDDT thresholds: >70% and >50% confidence, which were defined after manually curating the 100 randomly selected AF models.

In the following points, we outline different situations users may face:
• Proteins with high confidence values: the overall AF predicted structure possesses high pLDDTs. All three computed structures converge into a unique aggregation prediction. Well-defined S-APRs can be delineated from A3D output structures. (representative examples: Q68D91, S4R460)
• Proteins with limited low confidence regions: proteins that display local surface modifications with confidence thresholds of >70% or >50% pLDDT. An increase or decrease of localized S-APRs can be observed.
(representative examples: P09382, Q9UKL6, Q5EE01)
• Multidomain proteins: proteins with two or more globular domains that correspond to well-defined and confident regions, while tethering elements that connect globular domains possess low pLDDT values. The analysis with restricted pLDDT thresholds often results in disconnected domains and, as a consequence, some previous sheltered APR might become exposed. As a result, S-APRs might diverge in the tree models, and their boundaries are less evident.
(representative examples: Q96MN5, Q96DA0)
• Proteins with extended low-confidence regions: proteins with large low confidence regions localized in flexible C-, N- termini, or long loops, usually displayed around a well-defined globular core. With the 70% and 50% pLDDTs cutoffs, disordered regions are often excluded from the model and, therefore, not considered in A3D prediction. As a result, the A3D predictions correspond to well-defined globular domains. Note that we might observe the presence of free amino acids in some cases, unattached to the protein main chain, essentially because they are in the exclusion limit between two pLDDT thresholds.
(representative examples: Q9NRE2, Q8NFW1)
• Proteins with overall low confidence: usually, short polypeptides that are mostly disordered, exposing most of its surface to solvent. In some cases, we lose all the structure in predictions corresponding to >70% or >50% pLDDT threshold.
(representative examples: Q8N9P0, Q8TEV8, Q9GZY1)

Quaternary structure context

In the AF database, structure predictions are restricted to single chains. This precludes the analysis of the quaternary structure context in the A3D Database. The overlap between the physicochemical properties governing protein-protein native interactions and non-native contacts triggering aberrant self-assembly implies that protein interfaces are enriched in S-APRs. In proteins displaying quaternary structure, this results in an over-prediction of S-APRs in protein monomers, relative to native the same subunits in the oligomeric state.

Gathering important information

We strongly encourage users to compile as much as possible information about their case study protein to exploit A3D predictions.
Some helpful questions to shape the output would be:
• (i) Does the protein possess disordered regions?
• (ii) Do these disordered segments correlate with low pLDDT?
• (iii) Does the protein bind another protein? Does it form a homo- or hetero-oligomer?
• (iv) Does the protein contain a signal peptide?
• (v) Is the protein the mature form or on the contrary, is a proprotein?

In addition to the aforementioned pieces of advice, we also recommend reviewing dedicated bibliography on how to manage the A3D algorithm and how to evaluate its aggregation predictions:
1. AGGRESCAN3D: Toward the Prediction of the Aggregation Propensities of Protein Structures, Methods Mol Biol. 2018, 1762:427-443.
2. Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility, Nucleic Acids Res. 2019, 47(W1):W300-W307.
3. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures, Nucleic Acids Res. 201, 43(W1):W306-13.
4. Computational prediction of protein aggregation: Advances in proteomics, conformation-specific algorithms and biotechnological applications, Comput Struct Biotechnol J. 2020, 18:1403-1413.

The AlphaFold Protein Structure Database collects 23391 predictions assigned to the Homo sapiens proteome, corresponding to 20504 unique UniProt entries. This set covers well the subset of human proteins with reviewed status, which are longer than 16 amino acids. The extremely long sequences (longer than 2700 residues and up to 34350) were split into overlapping fragments and modeled independently. As a result, the AF-DB provides multiple structure predictions, a few to several dozen, for a particular UniProt identifier.
The surface of membrane proteins exhibits distinct physicochemical properties that exert a considerable influence on A3D predictions. Consequently, we have identified a comprehensive set of transmembrane and intramembrane proteins, totaling 5934 in count. Expanding the scope of this database, we have included membrane proteins, encompassing not only A3D-derived aggregation predictions but also providing TOPCONS-generated membrane predictions. This augmented approach has been further enhanced by annotation with signal peptide information for 4360 entries.

For all the proteins, we collected the rich annotations. They include UniProt identifier, gene, common short name, and long descriptive name. One can use any of them to search both AF and A3D databases. For the user's convenience, we also added information about the length of the original sequence. In addition, for sequences split into fragments (209), we provided the region corresponding to the predicted structure. It can make it much easier to identify a domain of interest to the user.
Because the AF-predicted structures vary considerably in confidence level, we also report for each entry the number of residues below a threshold of 70 and 50, which can help to screen out underpredictions for user-defined A3D analysis.
The unified and integrated metadata accompanied by referencing identifiers in the A3D database is available for download in CSV format.

A3D has been shown to be accurate in predicting the changes in the solubility of globular proteins upon mutation (PMID). We compared the performance of A3D when modeling the impact of mutations on the solubility on top of either the experimental structures or the equivalent AF-derived models, using the mutation editor in the A3D database. To this aim, we selected a reduced set of structurally and sequentially unrelated proteins (see the manuscript Supplementary Information, Table S2 and Figure S5).
The predictions turned out to be accurate and coincident for all the proteins, independently of the kind of structural input we used. In addition to changes in solubility, A3D provides the impact of the selected mutations on protein stability, as calculated by FoldX (PMID). Importantly, it has been shown that the quality of AF-models is sufficient to predict protein solubility changes upon mutation using FoldX (PMID).

AA Each hproteome job's results can be downloaded via the REST API. A job can be identified by an unique identifier called a job id. A sample Python script is provided below that will allow to download the entire results for the specific job. For other usages of the API please see the main page tutorial. The job ID can be found in the address of the web browser when viewing the job. For example http://biocomp.chem.uw.edu.pl/A3D2/hproteome_job/65417ebb9b66183/ would mean that the job ID = 65417ebb9b66183. In the Download tab A3D_Database.csv file can be downloaded which contains IDs of all the non-custom jobs.

#!/usr/bin/env python
import requests

req = requests.get('http://biocomp.chem.uw.edu.pl/A3D2/RESTful/hproteome_job/job_id/')

print(req.status_code)
data = req.json()
for k in data.keys():
    print("key: %s: %s" % (k, data[k]))

The script should return data in the following format, a 404 response indicates a job id that doesn't exist on the database.

key: A3Dscore: {'avg': -0.5321, 'max': 1.8543, 'min': -2.6878, 'sum': -51.0901, 'tab': [['1', 'E', 'A', '-1.3183'], ['2', 'V', 'A', '0.4278'], ... ]}
key: af_data: {'conf_cutoff': '50.00', 
'full_protein': True, 
'gene': 'IGHV3OR16-9', 
'long_name': 'Immunoglobulin heavy variable 3/OR16-9 (non-functional)', 
'n_res': 96, 
'organism': 'Homo sapiens', 
'region': '1-97', 
'scores': [[1, 'GLU', '89.30'], [2, 'VAL', '94.18'], ... ], 
'segment': 1, 
'seq_len': 97, 
'uniprot_id': 'S4R460'
}
key: aggregation_distance: 10
key: automated_mutations: None
key: chain_sequence: EVQLVESGGGLVQPGGSLRLSCAASGFTFSNHYTSWVRQAPGKGLEWVSYSSGNSGYTNYADSVKGRFTISRDNAKNSLYLQMNSLRAEDTAVYYC
key: dynamic_mode: 0
key: mutated_residues: None
key: mutation_energetic_effect: 0.0
key: mutation_mode: 0
key: project_name: AF-S4R460-F1_c50
key: stability_calculations: 1
key: started: Mon, 23 Aug 2021 13:05:35 GMT
key: status: done
key: updated: Mon, 23 Aug 2021 13:06:30 GMT

About

Structure confidence and disorder

To deal with this issue, the A3D dataset includes two additional precalculated structural aggregation profiles per default, which correspond to predicted structures with pLDDT thresholds: >70% and >50% confidence, which were defined after manually curating the 100 randomly selected AF models.

Quaternary structure context

Gathering important information

Papers & Methods

Papers describing A3D database:

The papers describing A3D method and its applications:

Download

Download the results of A3D analysis

Download the annotations

The unified and integrated metadata accompanied by referencing identifiers in the A3D database is available for download in CSV format.

A3D-HSAPIENS.csv

Download the data

The unified and integrated metadata accompanied by referencing identifiers in the A3D database is available for download in CSV format.

HUMAN_SCORES.tar.gz

About

Structure confidence and disorder

To deal with this issue, the A3D dataset includes two additional precalculated structural aggregation profiles per default, which correspond to predicted structures with pLDDT thresholds: >70% and >50% confidence, which were defined after manually curating the 100 randomly selected AF models.

Quaternary structure context

Gathering important information

Papers & Methods

Papers describing A3D database:

The papers describing A3D method and its applications:

Download

Download the results of A3D analysis

Download the annotations

The unified and integrated metadata accompanied by referencing identifiers in the A3D database is available for download in CSV format. A3D-HSAPIENS.csv

Download the data

The unified and integrated metadata accompanied by referencing identifiers in the A3D database is available for download in CSV format. HUMAN_SCORES.tar.gz

The unified and integrated metadata accompanied by referencing identifiers in the A3D database is available for download in CSV format.

A3D-HSAPIENS.csv

The unified and integrated metadata accompanied by referencing identifiers in the A3D database is available for download in CSV format.

HUMAN_SCORES.tar.gz