Any questions?

What is HapScoreDB?

HapScoreDB is a proteogenomic database providing a comprehensive collection of Protein Language Model (PLM) scores for haplotype-resolved protein-coding sequences, encompassing all human transcript isoforms. Our approach uniquely integrates GENCODE and Ensembl gene and transcript models with phased variant data from the 1000 Genomes Project. For each protein-coding transcript, we have meticulously reconstructed sequences of common protein haplotypes containing single or multiple variants, including both SNPs (single nucleotide polymorphisms) and INDELs (insertions/deletions). To quantify functional impact, we computed scores using state-of-the-art protein language models such as ESM2, ProSST, and PoET, providing deep and contextualized representations for each protein haplotype.

What is a Protein Language Model?

A protein language model is a class of probabilistic, computational models engineered to capture the complex statistical patterns inherent in protein sequences. The core purpose of these models is to learn a high-dimensional representation of the sequence space that corresponds to functional, evolutionarily viable proteins. To achieve this, these models are trained on extensive biological sequence databases with the same techniques used in Natural Language Processing. A predominant self-supervised strategy is masked language modeling, where the model learns to predict masked amino acids based on their surrounding context. Through this and similar tasks, the model implicitly learns the complex patterns underlying protein structure and function, including the co-evolutionary constraints between residue positions—information that is otherwise explicitly derived from the analysis of a Multiple Sequence Alignment (MSA). Once trained, a key downstream capability of these models is to evaluate an arbitrary protein sequence by assigning it a quantitative score. We compute this score as the pseudo-log-likelihood of the sequences, representing the average of conditional probabilities of each amino acid given the context of the entire sequence. A higher pseudo-log-likelihood score indicates that the sequence has a higher probability under the learned statistical distribution, thereby providing a robust, quantitative measure of its biological plausibility and consistency with known evolutionary and structural constraints.

How are the scores calculated?

For ESM2 and ProSST, we compute the predicted pseudo-log-likelihoods of every protein sequence passed once to the models, without masking any amino acid, following Brandes et al 2023 (doi.org/10.1038/s41588-023-01465-0):

$$ P_{PLL}(S) = \frac{1}{L} \sum_{i=1}^{L} \log(P(r_i = r_i^S | S_{mt})) $$

Where $S$ is the input protein sequence, $L$ its length and $r_i^S$ the aminoacid at position $i$ in $S$. We find this approach for masked language models to correlate well with experimental data from the ProteinGym benchmark in the zero-shot setting. Albeit at a slighly lower performance for single variant effect prediction compared to the commonly used wild-type marginals method, this type of score can capture the effect of indels and of epistatic interactions, when multiple variants are present in the same sequence.

In the case of ProSST, the input structures are taken from the AlphaFoldDB (Varadi et al 2024, doi.org/10.1093/nar/gkad1011), and encoded using the 4096 long version of the structure sequence alphabet. For PoET, an autoregressive generative model, we use the same scoring function as in the original paper, averaging over multiple context lengths as in their work, and using as inputs MSAs generated from the Uniref100 database employing the ColabFold protocol.

What do PLLR_wt and PLLR_mf mean?

The Pseudo Log-Likelihood (PLL) of each haplotype-transcript couple should be compared to a reference to quantify how much a certain haplotype can shift the score of the transcript. In order to do so we adopted the Pseudo Log-Likelihood Ratio (PLLR) metric already known in literature. Since we are handling log-likelihoods, the ratio is the difference between two PLLs. We computed, for each score, its ratio with respect to the score of the most frequent haplotype for the specific transcript, namely PLLR-mf, and with respect to what is annotated as wild type in Ensembl, PLLR-wt. Most of the time these two ratios coincide.

What do score deltas represent?

Score deltas are transcript level metrics, meaning that all entries in the table sharing the same transcript ID will share the same score delta (one per model). The score delta is computed as the difference between the maximum and the minimum score obtained by any haplotype of that transcript. Comparing the score delta of a transcript against the distribution of all score deltas should give a quantitative measure of the functional variability of a transcript in the population.

How were protein sequences generated?

For each transcript’s isoform, all haplotypes comprising common variants were identified. For each of these haplotypes, a mutated sequence was generated by including the haplotype’s variants in the wild type coding sequence. When a Transcription Start Site is lost due to variants, the first next starting codon is used as Transcription Start Site. When a nonsense mutation occurs, the protein is truncated at that point.

What types of genes and transcript are taken into consideration?

Genes and transcripts annotations are downloaded from Ensembl. In total, 18741 protein coding genes and all their isoforms leading to 78282 transcripts were downloaded. Transcripts longer than 4000 amino acids were removed due to model and computational limitations, while transcripts shorter than 10 amino acids were removed due to

Where do genotype data come from?

Genotypes data are taken from the phased 1000 Genomes release phase 3, aligned to reference genome hg38. Lowy-Gallego E, Fairley S, Zheng-Bradley X et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project . Wellcome Open Res 2019, 4:50 (https://doi.org/10.12688/wellcomeopenres.15126.2)

Which variants were taken into consideration?

To generate the database of mutated sequences all (SNPs and INDELs) phased common (MAF > 0.5%) coding variants identified in 1000 Genomes were taken into consideration.

Columns description

Version

License

All data and download files in HapScoreDB are freely available under a 'Creative Commons BY 4.0' license. When using the data, please provide appropriate credit — and inform users of any changes or additions that you might have made to the data.

API

HapScoreDB data are also available through rest API at the endpoint:

https://bcglab.cibio.unitn.it/hapscoredbAPI/data

with the following parameters:

Example: curl "https://bcglab.cibio.unitn.it/hapscoredbAPI/data?genes=ENSG00000164002,ENSG00000065978"

Cite

If you use HapScoreDB, please cite the original publication:

Fabio Mazza, Filippo Gastaldello, Davide Dalfovo, Gianluca Lattanzi, Alessandro Romanel,
HapScoreDB: a database of protein language model functional scores for haplotype-resolved protein sequences, Nucleic Acids Research, 2025
https://doi.org/10.1093/nar/gkaf1184

Exclude haplotypes with:

Select desired Ensembl transcript support level (tsl):

Any questions?

https://bcglab.cibio.unitn.it/hapscoredbAPI/data

If you use HapScoreDB, please cite the original publication: