zlacker

[return to "AlphaFold reveals the structure of the protein universe"]
1. COGlor+JD[view] [source] 2022-07-28 15:03:35
>>MindGo+(OP)
Before my comment gets dismissed, I will disclaim I am a professional structural biologist that works in this field every day.

These threads are always the same: lots of comments about protein folding, how amazing DeepMind is, how AlphaFold is a success story, how it has flipped an entire field on it's head, etc. The language from Google is so deceptive about what they've actually done, I think it's actually intentionally disingenuous.

At the end of the day, AlphaFold is amazing homology modeling. I love it, I think it's an awesome application of machine learning, and I use it frequently. But it's doing the same thing we've been doing for 2 decades: pattern matching sequences of proteins with unknown structure to sequences of proteins with known structure, and about 2x as well as we used to be able to.

That's extremely useful, but it's not knowledge of protein folding. It can't predict a fold de novo, it can't predict folds that haven't been seen (EDIT: this is maybe not strictly true, depending on how you slice it), it fails in a number of edge cases (remember, in biology, edge cases are everything) and again, I can't stress this enough, we have no new information on how proteins fold. We know all the information (most of at least) for a proteins final fold is in the sequence. But we don't know much about the in-between.

I like AlphaFold, it's convenient and I use it (although for anything serious or anything interacting with anything else, I still need a real structure), but I feel as though it has been intentionally and deceptively oversold. There are 3-4 other deep learning projects I think have had a much greater impact on my field.

EDIT: See below: https://news.ycombinator.com/item?id=32265662 for information on predicting new folds.

◧◩
2. flobos+lG[view] [source] 2022-07-28 15:17:29
>>COGlor+JD
> AlphaFold is amazing homology modeling

If it is homology modelling, then how can it work without input template structures?

◧◩◪
3. COGlor+BJ[view] [source] 2022-07-28 15:30:08
>>flobos+lG
It has template structures. AlphaFold uses the following databases:

    BFD,
    MGnify,
    PDB70,
    PDB (structures in the mmCIF format),
    PDB seqres – only for AlphaFold-Multimer,
    Uniclust30,
    UniProt – only for AlphaFold-Multimer,
    UniRef90.
◧◩◪◨
4. flobos+kK[view] [source] 2022-07-28 15:33:10
>>COGlor+BJ
Those databases are used to derive the evolutionary couplings and distance matrices used by the algorithm. Several of those databases aren’t even structural ones. Furthermore, AlphaFold can function with only a MSA as an input, without retrieving a single PDB coordinate.
◧◩◪◨⬒
5. COGlor+ZL[view] [source] 2022-07-28 15:40:09
>>flobos+kK
It's all about boosting signal by finding other proteins that are similar, until you get to the point that you can identify a fold to assign to a region of the protein. That's why some are structural, and some are not.

>Furthermore, AlphaFold can function with only a MSA as an input, without retrieving a single PDB coordinate.

Yes, it has a very nice model of what sequences should look like in 3D. That model is derived from experimental data. So if I give AlphaFold an MSA of a new, unknown protein fold (substantively away from any known fold), it cannot predict it.

◧◩◪◨⬒⬓
6. flobos+sO[view] [source] 2022-07-28 15:50:30
>>COGlor+ZL
> Yes, it has a very nice model of what sequences should look like in 3D.

A structural model, you would say.

> That model is derived from experimental data.

That doesn’t make it a template-based model, or a homology one.

> if I give AlphaFold an MSA of a new, unknown protein fold (substantively away from any known fold), it cannot predict it

That will depend on the number of effective sequences found to derive couplings. Domains with novel folds usually have a low number of remotely homolog sequences and for that reason the method will fail, not just because they are novel.

◧◩◪◨⬒⬓⬔
7. COGlor+o71[view] [source] 2022-07-28 17:05:11
>>flobos+sO
>Domains with novel folds usually have a low number of remotely homolog sequences and for that reason the method will fail, not just because they are novel.

How can you say this but not believe it's doing homology modeling?

◧◩◪◨⬒⬓⬔⧯
8. flobos+yb1[view] [source] 2022-07-28 17:19:08
>>COGlor+o71
Because homology search is not homology modelling. And a multiple sequence alignment is not a structural (i.e, with three-dimensional coordinates) template.
◧◩◪◨⬒⬓⬔⧯▣
9. ssivar+zq1[view] [source] 2022-07-28 18:26:48
>>flobos+yb1
For someone who knows very little about this field, could you elaborate on what specific aspect of “homology modeling” AF violates/circumvents which makes you call it “homology search” instead?
◧◩◪◨⬒⬓⬔⧯▣▦
10. flobos+Aw1[view] [source] 2022-07-28 18:59:33
>>ssivar+zq1
Homology search is a method to find homologous sequences, that is, evolutionary related sequences that posess a common ancestor. This was usually done based on how identical sequences were, but newer algorithms allow to find remote homologs even when the identity between the sequences is very low. The first step in AlphaFold is to retrieve as many remotely homolog sequences as possible to generate a multiple sequence alignment (MSA) that will be used to generate the embedding.

On the other hand, homology (or comparative) modelling is a method that generates a structural model of a query sequence based on one or more experimentally solved structure of a close protein homolog. The model generation details depend on the specific protocol but, broadly speaking, spatial restraints are extracted from the template structures and mapped to the query sequence to be modelled.

Note that AlphaFold also uses a type of geometrical restraint (pairwise residue distances) in its modelling, although they are not derived from protein structures but the MSA embeddings. Both are related but are not exactly the same.

One difference between AlphaFold and homology modelling is that the latter requires templates having a certain sequence identity with the query sequence (≥30% is the rule of thumb), while the former can have in its MSA remotely homolog sequences well below any discernible identity.

[go to top]