Frequently Asked Questions
- Questions about ProteomeHD (unsupervised machine learning)
- What is ProteomeHD?
- ProteomeHD is a database of protein abundance changes in response to biological perturbations.
It is a data matrix for functional proteomics: proteins that are up- or downregulated to a similar extent under the same biological conditions probably have related cellular functions.
ProteomeHD differs from other drafts of the human proteome in that it does not catalogue the proteome of specific tissues or subcellular compartments. Instead, ProteomeHD catalogues the transitions between different proteome states, i.e. changes in protein abundance or localization resulting from cellular perturbations. HD, or high-definition, refers to two aspects of the dataset. First, quantitation accuracy: all experiments are quantified using SILAC (stable isotope labelling by amino acids in cell culture). Second, HD refers to the number of observations ("pixels") available for each protein. As more perturbations are analysed, regulatory patterns become more refined and can be compared more accurately.
- How many proteins and conditions does ProteomeHD cover?
- ProteomeHD v1.0 contains 10,323 human proteins and 294 biological conditions. Not every protein has been detected in every experiment. On average, there are 112 SILAC measurements for each protein.
- How do you measure protein co-regulation in ProteomeHD?
- We use the unsupervised machine-learning algorithm treeClust from Buttrey and Whitaker. For all possible pairs of proteins it determines how similar are their abundance changes across ProteomeHD. We find treeClust to be a strong improvement over the more commonly used Pearson correlation, both in sensitivity and robustness.
TreeClust uses decision trees that can handle missing values, which are common in shotgun proteomics data. It outputs a dissimilarity metric that reflects how often two proteins land in the same decision tree leaves. For any two proteins this allows us to determine a “co-regulation score”, defined as (1 - treeClust dissimilarity).
- Why modify the co-regulation score cut-off?
We use the treeClust algorithm to determine a co-regulation score for all possible pairs of proteins. The co-regulation score ranges from 0 (completely unrelated) to 1 (perfect co-regulation). We define two proteins as “co-regulated” if their co-regulation score is above 0.5. However, this is an arbitrary cut-off. For some proteins with many co-regulation partners, you may wish to increase the cut-off, in order to focus only on the most strongly co-regulated proteins. Other proteins may not be co-regulated strongly with any other proteins. In such cases, lowering the cut-off may be necessary to bring up functional clues. In practice, we recommend trying cut-offs between 0.4 and 0.7.
Note that some restrictions apply: The interactive plots on the website display up to 1,000 co-regulation partners (with a warning), but the full set can be downloaded as csv file using score cut-off 0.0. A maximum of 100 proteins can be transferred to STRING and subjected to GO or KEGG analysis (the 100 with the highest score are automatically selected).
- What is the proteome co-regulation map?
- We determine a co-regulation score for all possible pairs of proteins. In other words, for every protein we measure how strongly - or weakly - it is co-regulated with any other protein. These are the data you can download, they are displayed in the tables and form the basis of all downstream analyses. However, this co-regulation data set is also very complex, so we visualise it through t-Distributed Stochastic Neighbor Embedding (t-SNE). t-SNE is a technique for dimensionality reduction, which captures relationships between proteins in the high-dimensional co-regulation dataset and preserves them in a two-dimensional map. The more similar two proteins behave across ProteomeHD, the closer they are plotted in the t-SNE map. In this way, complex relationships between thousands of proteins can be visualised in a simple, human-readable 2D plot.
- Why are some co-regulated proteins far apart on the map?
In short, a protein may be far apart from its co-regulation partner if it is more strongly co-regulated with a different set of proteins. The co-regulation map is a two-dimensional representation of the complex (high-dimensional) co-regulation dataset. It is a simplification that enables us to show the general layout of the human proteome in a 2D plot. However, many proteins are multifunctional and may be partially co-regulated with proteins from distinct biological processes. In such cases the position in the 2D map needs to be a compromise, optimised by the t-SNE algorithm. In general, it is best to use the map as a visualization tool or to get a quick, general impression of a protein’s potential function.
For detailed functional annotation it is recommended to explore the actual pairwise co-regulation scores.
- Why are not all proteins covered by the map?
- At the moment we restrict the co-regulation analysis to proteins which have been observed in at least 95 SILAC experiments, in order to increase robustness and accuracy.
- Does the Gene Ontology and KEGG enrichment take into account the selected score cut-off?
- Yes, when the protein or score cut-off is changed, GO and KEGG enrichment analysis is repeated. Only the top 100 proteins will be used for the analysis. This is an upper cap and not a fixed number. This means that if your protein list has 10 members, only those 10 will be used for enrichment analysis.
- Can I link to a particular protein and cut-off?
- Yes, the “Copy shareable link to clipboard” button creates a link that contains both the protein ID and the currently selected score cut-off.
- Which organisms does ProteomeHD cover?
- Homo sapiens only. Other organisms may be added in the long-term future.
- Why does the STRING network not show all my co-regulated proteins?
We transfer up to 100 proteins to STRING. If there are more than 100 co-regulated proteins at your chosen score cut-off,
only the 100 highest scoring will be transferred to STRING.
However, you can download the entire list of co-regulation partners and search STRING manually at https://string-db.org
- Do you have any APIs for accessing the resources?
We offer two separate API calls for direct linking to our resources.
You can directly to a protein of interest with a given score cutoff by constructing a link:
Make sure that the exact Uniprot Accession is in our data. You can copy such a link into your clipboard by clicking a button on the query page.
We also offer a way to link to a specific interaction, i.e. a query protein and the given target with a score above the chosen cut-off.
For this you will have to create such a link:
At the moment purely automated access (e.g. returning a json) is not implemented.
- How can I cite ProteomeHD?
- You can’t yet. But please check here again soon, we are working on it!
- What is ProteomeHD?