Frequently Asked Questions
- Questions about ProteomeHD (unsupervised machine learning)
- What is ProteomeHD?
- ProteomeHD is a database of protein abundance changes in response to biological perturbations.
It is a data matrix for functional proteomics: proteins that are up- or downregulated to a similar extent under the same biological conditions probably have related cellular functions.
ProteomeHD differs from other drafts of the human proteome in that it does not catalogue the proteome of specific tissues or subcellular compartments. Instead, ProteomeHD catalogues the transitions between different proteome states, i.e. changes in protein abundance or localization resulting from cellular perturbations. HD, or high-definition, refers to two aspects of the dataset. First, quantitation accuracy: all experiments are quantified using SILAC (stable isotope labelling by amino acids in cell culture). Second, HD refers to the number of observations ("pixels") available for each protein. As more perturbations are analysed, regulatory patterns become more refined and can be compared more accurately.
- How many proteins and conditions does ProteomeHD cover?
- ProteomeHD v1.0 contains 10,323 human proteins and 294 biological conditions. Not every protein has been detected in every experiment. On average, there are 112 SILAC measurements for each protein.
- How do you measure protein co-regulation in ProteomeHD?
- We use the unsupervised machine-learning algorithm treeClust from Buttrey and Whitaker. For all possible pairs of proteins it determines how similar are their abundance changes across ProteomeHD. We find treeClust to be a strong improvement over the more commonly used Pearson correlation, both in sensitivity and robustness.
TreeClust uses decision trees that can handle missing values, which are common in shotgun proteomics data. It outputs a dissimilarity metric that reflects how often two proteins land in the same decision tree leaves. For any two proteins this allows us to determine a “co-regulation score”, defined as (1 - treeClust dissimilarity).
- Why modify the co-regulation score cut-off?
We use the treeClust algorithm with Topological Overlap Measure (TOM) to determine a co-regulation score for all possible pairs of proteins. This original treeClust+TOM score is difficult to interpret intuitively, because it covers a range between 0 and 0.25 and is highly skewed. We therefore turned our co-regulation score into a "percentile score". This means that, for example, a score of 0.995 corresponds to the strongest 0.5% of all co-regulated pairs across the entire dataset. We recommend starting with the default cut-off of 0.995 and decreasing it in steps of 0.01 if insufficient co-regulation partners are found (not all proteins have strong co-regulation partners). The original, non-percentile version of the treeClust+TOM co-regulation scores is available in the download section. You can also see the treeClust+TOM scores when you download the co-regulation partners of your query using the button below our co-regulation map.
Note that some restrictions apply: The interactive plots on the website display up to 1,000 co-regulation partners (with a warning), but the full set can be downloaded as csv file using score cut-off 0.0. A maximum of 100 proteins can be transferred to STRING and subjected to GO or KEGG analysis (the 100 with the highest score are automatically selected).
- What is the proteome co-regulation map?
- We determine a co-regulation score for all possible pairs of proteins. In other words, for every protein we measure how strongly - or weakly - it is co-regulated with any other protein. These are the data you can download, they are displayed in the tables and form the basis of all downstream analyses. However, this co-regulation data set is also very complex, so we visualise it through t-Distributed Stochastic Neighbor Embedding (t-SNE). t-SNE is a technique for dimensionality reduction, which captures relationships between proteins in the high-dimensional co-regulation dataset and preserves them in a two-dimensional map. The more similar two proteins behave across ProteomeHD, the closer they are plotted in the t-SNE map. In this way, complex relationships between thousands of proteins can be visualised in a simple, human-readable 2D plot.
- Why are some co-regulated proteins far apart on the map?
In short, a protein may be far apart from its co-regulation partner if it is more strongly co-regulated with a different set of proteins. The co-regulation map is a two-dimensional representation of the complex (high-dimensional) co-regulation dataset. It is a simplification that enables us to show the general layout of the human proteome in a 2D plot. However, many proteins are multifunctional and may be partially co-regulated with proteins from distinct biological processes. In such cases the position in the 2D map needs to be a compromise, optimised by the t-SNE algorithm. In general, it is best to use the map as a visualization tool or to get a quick, general impression of a protein’s potential function.
For detailed functional annotation it is recommended to explore the actual pairwise co-regulation scores.
- Why are not all proteins covered by the map?
- At the moment we restrict the co-regulation analysis to proteins which have been observed in at least 95 SILAC experiments, in order to increase robustness and accuracy.
- Does the Gene Ontology and KEGG enrichment take into account the selected score cut-off?
- Yes, when the protein or score cut-off is changed, GO and KEGG enrichment analysis is repeated. Only the top 100 proteins will be used for the analysis. This is an upper cap and not a fixed number. This means that if your protein list has 10 members, only those 10 will be used for enrichment analysis.
- Can I link to a particular protein and cut-off?
- Yes, the “Copy shareable link to clipboard” button creates a link that contains both the protein ID and the currently selected score cut-off.
- Which organisms does ProteomeHD cover?
- Homo sapiens only. Other organisms may be added in the long-term future.
- Why does the STRING network not show all my co-regulated proteins?
We transfer up to 100 proteins to STRING. If there are more than 100 co-regulated proteins at your chosen score cut-off,
only the 100 highest scoring will be transferred to STRING.
However, you can download the entire list of co-regulation partners and search STRING manually at https://string-db.org
- Do you have any APIs for accessing the resources?
We offer two separate API calls for direct linking to our resources.
You can directly to a protein of interest with a given score cutoff by constructing a link:
Make sure that the exact Uniprot Accession is in our data. You can copy such a link into your clipboard by clicking a button on the query page.
We also offer a way to link to a specific interaction, i.e. a query protein and the given target with a score above the chosen cut-off.
For this you will have to create such a link:
At the moment purely automated access (e.g. returning a json) is not implemented.
- I'm constantly getting an error message: 'Error in the CSRF Token field - The CSRF tokens do not match'
- The most typical reason for this is disabling cookies in your browser or using some ad-blockers. This site requires cookies to function (for example to remember your last protein search or score cut-off in your browser). Some ad-blockers or browser 'incognito' modes may interfere with cookies and might cause issues with our tool.
- How can I cite ProteomeHD?
You can cite our work published in Nature Biotechnology:
Kustatscher, G., Grabowski, P., Schrader, T.A. et al. Co-regulation map of the human proteome enables identification of protein functions. Nat Biotechnol 37, 1361–1371 (2019) doi:10.1038/s41587-019-0298-5
Link to the publication: https://www.nature.com/articles/s41587-019-0298-5
- What is ProteomeHD?
- Questions about Progulon Finder (supervised machine learning)
- Which Random Forest score cut-off do you use to define a “protein regulon”?
- From a statistical perspective, a Random Forest score above 0.5 signifies that a protein belongs to the “same class” as the uploaded protein list. In our case this would translate to being part of the same regulon. This should be taken as a rule of thumb only. We find it is often helpful to manually inspect the predictions and perhaps choose a biologically more relevant cut-off at a slightly higher (or lower) Random Forest score.
- How do you calculate the area under the precision-recall curve (AUPRC) and which value do you consider sufficient?
- ProgulonFinder calculates the area under the precision-recall curve based on cross-validated training proteins (learn more about the architecture of the workflow). An area under the curve (AUC) of 1 means that all uploaded training proteins could be perfectly separated from ~1,000 randomly chosen proteins. In practice, it is challenging to specify a good value for this parameter. The rule-of-thumb is that the closer this value is to the random classifier (this score is also provided in the report), the worse the model performed. This suggests that the selected training proteins don't have a strong co-regulation pattern in ProteomeHD.
- Why does the result file contain multiple Uniprot ACs for the same protein?
- In proteomics, proteins are identified by sequencing peptides, but normally only a few peptides of each protein are observed. If protein isoforms differ outside the observed region it is impossible to know which isoform was actually detected. Therefore, all isoforms that potentially fit the data are reported (separated by a semicolon).
- Can I use specific protein isoforms for training?
- No. When uploading a protein list, all available protein isoforms will be used for training. It would be difficult to use specific isoforms and exclude others, because often it is not clear which isoform precisely was observed in the experiment. Proteomics experiments usually report a range of isoforms that could all fit the observed data. If your protein of interest has isoforms with large functional differences (and these isoforms are found in ProteomeHD), you will need to omit that protein from your training list.
- Is there a limit to how many predictions I can submit?
- Yes. Due to limited resources each user is given a daily limit of 10 submissions per day.
- How can I cite the Progulon Finder tool?
You can cite our work published in Molecular Systems Biology:
Kustatscher G, Hödl M, Rullmann E, Grabowski P, Fiagbedzi E, Groth A, Rappsilber J. Higher-order modular regulation of the human proteome. Mol Syst Biol. 2023 Mar 9:e9503
Link to the publication: https://www.embopress.org/doi/full/10.15252/msb.20209503
- Which Random Forest score cut-off do you use to define a “protein regulon”?