Selected haplotypes
Affected transcripts
Involved variants
Genes
Transcripts
Haplotypes
Variants
For ESM2 and ProSST, we compute the predicted pseudo-log-likelihoods of every protein sequence passed once to the models, without masking any amino acid, following Brandes et al 2023 (doi.org/10.1038/s41588-023-01465-0):
$$ P_{PLL}(S) = \frac{1}{L} \sum_{i=1}^{L} \log(P(r_i = r_i^S | S_{mt})) $$Where $S$ is the input protein sequence, $L$ its length and $r_i^S$ the aminoacid at position $i$ in $S$. We find this approach for masked language models to correlate well with experimental data from the ProteinGym benchmark in the zero-shot setting. Albeit at a slighly lower performance for single variant effect prediction compared to the commonly used wild-type marginals method, this type of score can capture the effect of indels and of epistatic interactions, when multiple variants are present in the same sequence.
In the case of ProSST, the input structures are taken from the AlphaFoldDB (Varadi et al 2024, doi.org/10.1093/nar/gkad1011), and encoded using the 4096 long version of the structure sequence alphabet. For PoET, an autoregressive generative model, we use the same scoring function as in the original paper, averaging over multiple context lengths as in their work, and using as inputs MSAs generated from the Uniref100 database employing the ColabFold protocol.
curl "https://bcglab.cibio.unitn.it/hapscoredbAPI/data?genes=ENSG00000164002,ENSG00000065978"