A3D Database | Solubility and Aggregation Properties
|
The A3D database provides the analysis of solubility and aggregation properties for human protein structures from the AlphaFold Database.

About

Protein aggregation is associated with highly debilitating human disorders and constitutes a bottleneck for producing therapeutic proteins. The knowledge of the human protein structures repertoire has dramatically increased with the recent development of the AlphaFold Database (AF) deep-learning method. Applying the A3D analysis on high-quality AF structural predictions can improve the understanding of protein aggregation properties and open the possibility for the rational design of protein solubility.

Aggrescan3D (A3D) is a structure-based algorithm that uses three-dimensional protein coordinates to project each amino acid's intrinsic aggregation propensity value into the structure. Then, the aggregation score is corrected as a function of its specific solvent exposure and the aggregation propensity of neighboring residues within a 10 Å sphere of radius. Given this dependence on the spatial position of each atom of the protein, both the atomic resolution and the biological relevance of the input structures impact A3D prediction's accuracy.
Therefore, in order to correctly interpret A3D database aggregation tendencies, users might consider two critical characteristics of the AlphaFold predictions:
• Structure confidence and disorder
• Quaternary structure context
(see details in the next tab: Source of the structural data: AlphaFold predictions)

Structure confidence and disorder

AF predicted structures cover 98.5% of the human proteome and involve complete protein chains in their monomeric state. Each amino acid in an AF model has a confidence score (pLDDT) that ranges from 0 to 100, indicating which regions could be considered equivalent to an experimentally determined structure. It is estimated that around 30% of the amino acids of the human proteome are located in non-globular regions, such as intrinsically disordered regions. Therefore, it is not surprising that AF predicted structures regularly include regions with low pLDDT scores (PMID). These permanently or transiently disordered segments are often not defined in experimental structures, and therefore, till now, A3D was unaware of them. Now they are incorporated in the A3D Database. A3D predictions are blind to the dynamic nature of these regions since AF models correspond to static frames of the conformational ensemble of these regions. Therefore, the presence of disordered low AF confidence regions might artificially increase the presence of S-APRs* in adjacent high confidence protein domains.
* (Surface-exposed Aggregation Prone Regions)

To deal with this issue, the A3D dataset includes two additional precalculated structural aggregation profiles per default, which correspond to predicted structures with pLDDT thresholds: >70% and >50% confidence, which were defined after manually curating the 100 randomly selected AF models.

In the following points, we outline different situations users may face:
• Proteins with high confidence values: the overall AF predicted structure possesses high pLDDTs. All three computed structures converge into a unique aggregation prediction. Well-defined S-APRs can be delineated from A3D output structures. (representative examples: Q68D91, S4R460)
• Proteins with limited low confidence regions: proteins that display local surface modifications with confidence thresholds of >70% or >50% pLDDT. An increase or decrease of localized S-APRs can be observed.
(representative examples: P09382, Q9UKL6, Q5EE01)
Multidomain proteins: proteins with two or more globular domains that correspond to well-defined and confident regions, while tethering elements that connect globular domains possess low pLDDT values. The analysis with restricted pLDDT thresholds often results in disconnected domains and, as a consequence, some previous sheltered APR might become exposed. As a result, S-APRs might diverge in the tree models, and their boundaries are less evident.
(representative examples: Q96MN5, Q96DA0)
• Proteins with extended low-confidence regions: proteins with large low confidence regions localized in flexible C-, N- termini, or long loops, usually displayed around a well-defined globular core. With the 70% and 50% pLDDTs cutoffs, disordered regions are often excluded from the model and, therefore, not considered in A3D prediction. As a result, the A3D predictions correspond to well-defined globular domains. Note that we might observe the presence of free amino acids in some cases, unattached to the protein main chain, essentially because they are in the exclusion limit between two pLDDT thresholds.
(representative examples: Q9NRE2, Q8NFW1)
• Proteins with overall low confidence: usually, short polypeptides that are mostly disordered, exposing most of its surface to solvent. In some cases, we lose all the structure in predictions corresponding to >70% or >50% pLDDT threshold.
(representative examples: Q8N9P0, Q8TEV8, Q9GZY1)

Quaternary structure context

In the AF database, structure predictions are restricted to single chains. This precludes the analysis of the quaternary structure context in the A3D Database. The overlap between the physicochemical properties governing protein-protein native interactions and non-native contacts triggering aberrant self-assembly implies that protein interfaces are enriched in S-APRs. In proteins displaying quaternary structure, this results in an over-prediction of S-APRs in protein monomers, relative to native the same subunits in the oligomeric state.

Gathering important information

We strongly encourage users to compile as much as possible information about their case study protein to exploit A3D predictions.
Some helpful questions to shape the output would be:
• (i) Does the protein possess disordered regions?
• (ii) Do these disordered segments correlate with low pLDDT?
• (iii) Does the protein bind another protein? Does it form a homo- or hetero-oligomer?
• (iv) Does the protein contain a signal peptide?
• (v) Is the protein the mature form or on the contrary, is a proprotein?

In addition to the aforementioned pieces of advice, we also recommend reviewing dedicated bibliography on how to manage the A3D algorithm and how to evaluate its aggregation predictions:
1. AGGRESCAN3D: Toward the Prediction of the Aggregation Propensities of Protein Structures, Methods Mol Biol. 2018, 1762:427-443.
2. Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility, Nucleic Acids Res. 2019, 47(W1):W300-W307.
3. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures, Nucleic Acids Res. 201, 43(W1):W306-13.
4. Computational prediction of protein aggregation: Advances in proteomics, conformation-specific algorithms and biotechnological applications, Comput Struct Biotechnol J. 2020, 18:1403-1413.

The AlphaFold Protein Structure Database collects 23391 predictions assigned to the Homo sapiens proteome, corresponding to 20504 unique UniProt entries. This set covers well the subset of human proteins with reviewed status, which are longer than 16 amino acids. The extremely long sequences (longer than 2700 residues and up to 34350) were split into overlapping fragments and modeled independently. As a result, the AF-DB provides multiple structure predictions, a few to several dozen, for a particular UniProt identifier.
The membrane proteins have specific physicochemical properties on their surface, which can significantly bias the A3D predictions. Thus, we dropped transmembrane and intramembrane proteins (5156 in total), which further can be analyzed with a custom configuration of A3D.

For all the proteins, we collected the rich annotations. They include UniProt identifier, gene, common short name, and long descriptive name. One can use any of them to search both AF and A3D databases. For the user's convenience, we also added information about the length of the original sequence. In addition, for sequences split into fragments, we provided the region corresponding to the predicted structure. It can make it much easier to identify a domain of interest to the user.
Because the AF-predicted structures vary considerably in confidence level, we also report for each entry the number of residues below a threshold of 70 and 50, which can help to screen out underpredictions for user-defined A3D analysis.
The unified and integrated metadata accompanied by referencing identifiers in the A3D database is available for download in CSV format.

A3D has been shown to be accurate in predicting the changes in the solubility of globular proteins upon mutation (PMID). We compared the performance of A3D when modeling the impact of mutations on the solubility on top of either the experimental structures or the equivalent AF-derived models, using the mutation editor in the A3D database. To this aim, we selected a reduced set of structurally and sequentially unrelated proteins (see the manuscript Supplementary Information, Table S2 and Figure S5).
The predictions turned out to be accurate and coincident for all the proteins, independently of the kind of structural input we used. In addition to changes in solubility, A3D provides the impact of the selected mutations on protein stability, as calculated by FoldX (PMID). Importantly, it has been shown that the quality of AF-models is sufficient to predict protein solubility changes upon mutation using FoldX (PMID).

AA Each hproteome job's results can be downloaded via the REST API. A job can be identified by an unique identifier called a job id. A sample Python script is provided below that will allow to download the entire results for the specific job. For other usages of the API please see the main page tutorial. The job ID can be found in the address of the web browser when viewing the job. For example http://biocomp.chem.uw.edu.pl/A3D2/hproteome_job/65417ebb9b66183/ would mean that the job ID = 65417ebb9b66183. In the Download tab A3D_Database.csv file can be downloaded which contains IDs of all the non-custom jobs.
#!/usr/bin/env python
import requests

req = requests.get('http://biocomp.chem.uw.edu.pl/A3D2/RESTful/hproteome_job/job_id/')

print(req.status_code)
data = req.json()
for k in data.keys():
    print("key: %s: %s" % (k, data[k]))
The script should return data in the following format, a 404 response indicates a job id that doesn't exist on the database.
key: A3Dscore: {'avg': -0.5321, 'max': 1.8543, 'min': -2.6878, 'sum': -51.0901, 'tab': [['1', 'E', 'A', '-1.3183'], ['2', 'V', 'A', '0.4278'], ... ]}
key: af_data: {'conf_cutoff': '50.00', 
'full_protein': True, 
'gene': 'IGHV3OR16-9', 
'long_name': 'Immunoglobulin heavy variable 3/OR16-9 (non-functional)', 
'n_res': 96, 
'organism': 'Homo sapiens', 
'region': '1-97', 
'scores': [[1, 'GLU', '89.30'], [2, 'VAL', '94.18'], ... ], 
'segment': 1, 
'seq_len': 97, 
'uniprot_id': 'S4R460'
}
key: aggregation_distance: 10
key: automated_mutations: None
key: chain_sequence: EVQLVESGGGLVQPGGSLRLSCAASGFTFSNHYTSWVRQAPGKGLEWVSYSSGNSGYTNYADSVKGRFTISRDNAKNSLYLQMNSLRAEDTAVYYC
key: dynamic_mode: 0
key: mutated_residues: None
key: mutation_energetic_effect: 0.0
key: mutation_mode: 0
key: project_name: AF-S4R460-F1_c50
key: stability_calculations: 1
key: started: Mon, 23 Aug 2021 13:05:35 GMT
key: status: done
key: updated: Mon, 23 Aug 2021 13:06:30 GMT

Papers & Methods

The paper describing A3D database:

1. A3D Database: Structure-based Protein Aggregation Predictions for the Human Proteome, submitted, 2021.



The papers describing A3D method and its applications:

1. Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility, Nucleic Acids Research, gkz321, 2019.
2. Aggrescan3D standalone package for structure-based prediction of protein aggregation properties, Bioinformatics, btz143, 2019.
3. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures, Nucleic Acids Res., 43, W306-W313, 2015.
4. Combining Structural Aggregation Propensity and Stability Predictions To Redesign Protein Solubility, Molecular Pharmaceutics, 15(9), 3846-3859, 2018.

The example guide for A3D 2.0:
5. A3D 2.0 update for the prediction and optimization of protein solubility, Methods in Molecular Biology (biorxiv preprint), 2021.

Download

Download the results of A3D analysis

The results of the A3D analysis are available for download on the subpage dedicated to a particular entry in the A3D database.
Please use the search form, available at the A3D Database Home tab, to find the protein of your interest.
The A3D predictions of aggregation and solubility properties can be downloaded in CSV format (raw data) from the Aggrescan3D score tab, or as a graphical profile (image) from the Aggrescan3D plot tab.
The structural data can be downloaded in PDB format from the Structure tab. The images of user-taken snapshots are saved automatically in the Gallery tab and available for download.



Download the annotations

The unified and integrated metadata accompanied by referencing identifiers in the A3D database is available for download in CSV format.

A3D Database: Solubility and Aggregation Properties for the Human Proteome   A3D_Database.csv


Laboratory of Theory of Biopolymers 2018