zlacker

Before my comment gets dismissed, I will disclaim I am a professional structural biologist that works in this field every day.

These threads are always the same: lots of comments about protein folding, how amazing DeepMind is, how AlphaFold is a success story, how it has flipped an entire field on it's head, etc. The language from Google is so deceptive about what they've actually done, I think it's actually intentionally disingenuous.

At the end of the day, AlphaFold is amazing homology modeling. I love it, I think it's an awesome application of machine learning, and I use it frequently. But it's doing the same thing we've been doing for 2 decades: pattern matching sequences of proteins with unknown structure to sequences of proteins with known structure, and about 2x as well as we used to be able to.

That's extremely useful, but it's not knowledge of protein folding. It can't predict a fold de novo, it can't predict folds that haven't been seen (EDIT: this is maybe not strictly true, depending on how you slice it), it fails in a number of edge cases (remember, in biology, edge cases are everything) and again, I can't stress this enough, we have no new information on how proteins fold. We know all the information (most of at least) for a proteins final fold is in the sequence. But we don't know much about the in-between.

I like AlphaFold, it's convenient and I use it (although for anything serious or anything interacting with anything else, I still need a real structure), but I feel as though it has been intentionally and deceptively oversold. There are 3-4 other deep learning projects I think have had a much greater impact on my field.

EDIT: See below: https://news.ycombinator.com/item?id=32265662 for information on predicting new folds.

replies(16): >>mupuff+Y >>flobos+C2 >>Lloydk+c4 >>adamsm+S4 >>dekhn+x5 >>ramraj+F5 >>zack-m+If >>wrycod+Ay >>mellos+qH >>robbin+gK >>ehsank+UN >>uoaei+RO >>eachro+k31 >>simsla+O71 >>panabe+qB1 >>Protos+Gfe

>>COGlor+(OP)
> There are 3-4 other deep learning projects I think have had a much greater impact on my field.

Don't leave us hanging... which projects?

replies(1): >>COGlor+P5

>>COGlor+(OP)
> AlphaFold is amazing homology modeling

If it is homology modelling, then how can it work without input template structures?

replies(1): >>COGlor+S5

>>COGlor+(OP)
I mean like whats this about AlphaFold is gone

>>COGlor+(OP)
> it can't predict folds that haven't been seen

This seems strange to me. The entire point of these types of models is to predict things on unseen data. Are you saying Deepmind is completely lying about their model?

Deepmind solved CASP, isn't the entire point of that competition to predict unseen structures?

If AlphaFold doesn't predict anything then what are you using it to do?

replies(1): >>COGlor+06

>>COGlor+(OP)
I've directly communicated with the leaders of CASP and at DM that they should stop representing this as a form of protein folding and just call it "crystal/cryoEM structure prediction" (they filter out all the NMR structures from PDB since they aren't good for prediction). They know it's disingenuous and they do it on purpose to give it more impact than it really deserves.

I would like to correct somethign here- it does predict structures de novo and predict folds that haven't been seen before. That's because of the design of the NN- it uses sequence information to create structural constraints. If those constraints push the modeller in the direction of a novel fold, it will predict that.

To me what's important about this is that it demonstrated the obvious (I predicted this would happen eventually, shortly after losing CASP in 2000).

replies(1): >>COGlor+0a

>>COGlor+(OP)
Not sure if you should be reminded of how alpha fold started, it started by winning a competition thought un winnable by academics. Top labs working in protein structure prediction have fundamentally changed direction after alpha fold and are working to do the same even better.

This is not the first (or even tenth) time I’m seeing an academic trying to undermine genuine progress almost to the level of gaslighting. Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.

Not sure what else to say. Structural biology has always been the weirdest field I’ve seen, the way students are abused (crystallize and publish in nature or go bust), and how every nature issue will have three structure papers as if that cures cancer every day. I suppose it warps one’s perception of outsiders after being in such a bubble?

signed, someone with a PhD in biomedical engineering, did a ton of bio work.

replies(4): >>COGlor+e7 >>shpong+3b >>stonog+z21 >>Gatsky+lp1

>>mupuff+Y
1) Isonet - takes low SNR cryo-electron tomography images (that are extremely dose limited, so just incredibly blurry and frequently useless) and does two things:

* Deconvolutes some image aberrations and "de-noises" the images

* Compensates for missing wedge artifacts (missing wedge is the fact that the tomography isn't done -90° --> +90°, but usually instead -60° --> +60°, leaving a 30° wedge on the top and bottom of basically no information) which usually are some sort of directionality in image density. So if you have a sphere, the top and bottom will be extremely noisy and stretched up and down (in Z).

https://www.biorxiv.org/content/10.1101/2021.07.17.452128v1

2) Topaz, but topaz really counts as 2 or 3 different algorithms. Topaz has denoising of tomograms and of flat micrographs (i.e. images taken with a microscope, as opposed to 3D tomogram volumes). That denoising is helpful because it increases contrast (which is the fundamental problem in Cryo-EM for looking at biomolecules). Topaz also has a deep learning particle picker which is good at finding views of your protein that are under-represented, or otherwise missing, which again, normally results in artifacts when you build your 3D structure.

https://emgweb.nysbc.org/topaz.html

3) EMAN2 convolutional neural network for tomogram segmentation/Amira CNN for segmentation/flavor of the week CNN for tomogram segmentation. Basically, we can get a 3D volume of a cell or virus or whatever, but then they are noisy. To do anything worthwhile with it, even after denoising, we have to say "this is cell membrane, this is virus, this is nucleic acid" etc. CNNs have proven to be substantially better at doing this (provided you have an adequate "ground truth") than most users.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5623144/

>>flobos+C2
It has template structures. AlphaFold uses the following databases:

    BFD,
    MGnify,
    PDB70,
    PDB (structures in the mmCIF format),
    PDB seqres – only for AlphaFold-Multimer,
    Uniclust30,
    UniProt – only for AlphaFold-Multimer,
    UniRef90.

replies(1): >>flobos+B6

>>adamsm+S4
AlphaFold figures out that my input sequence (which has no structural data) is similar to this other protein that has structural data. Or maybe different parts of different proteins. It does this extremely well.

replies(1): >>flobos+27

>>COGlor+S5
Those databases are used to derive the evolutionary couplings and distance matrices used by the algorithm. Several of those databases aren’t even structural ones. Furthermore, AlphaFold can function with only a MSA as an input, without retrieving a single PDB coordinate.

replies(1): >>COGlor+g8

>>COGlor+06
This is a gross misrepresentation of the method.

replies(1): >>COGlor+n8

>>ramraj+F5
> Not sure if you should be reminded of how alpha fold started, it started by winning a competition thought un winnable by academics. Top labs working in protein structure prediction have fundamentally changed direction after alpha fold and are working to do the same even better.

Not sure what part of "it does homology modeling 2x better" you didn't see in my comment? AlphaFold scored something like 85% in CASP in 2020, in CASP 2016, I-TASSER had I think 42%? So it's ~2x as good as I-TASSER which is exactly what I said in my comment.

>This is not the first (or even tenth) time I’m seeing an academic trying to undermine genuine progress almost to the level of gaslighting. Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.

It literally is homology modeling. The deep learning aspect is to boost otherwise unnoticed signal that most homology modeling software couldn't tease out. Also, I don't think I'm gaslighting, but maybe I'm wrong? If anything, I felt gaslit by the language around AlphaFold.

>Not sure what else to say. Structural biology has always been the weirdest field I’ve seen, the way students are abused (crystallize and publish in nature or go bust), and how every nature issue will have three structure papers as if that cures cancer every day. I suppose it warps one’s perception of outsiders after being in such a bubble?

What on earth are you even talking about? The vast, VAST majority of structures go unpublished ENTIRELY, let alone published in nature. There are almost 200,000 structures on deposit in the PDB.

replies(4): >>underd+6d >>dekhn+Od >>joe_th+D31 >>teawre+en2

>>flobos+B6
It's all about boosting signal by finding other proteins that are similar, until you get to the point that you can identify a fold to assign to a region of the protein. That's why some are structural, and some are not.

>Furthermore, AlphaFold can function with only a MSA as an input, without retrieving a single PDB coordinate.

Yes, it has a very nice model of what sequences should look like in 3D. That model is derived from experimental data. So if I give AlphaFold an MSA of a new, unknown protein fold (substantively away from any known fold), it cannot predict it.

replies(1): >>flobos+Ja

>>flobos+27
Perhaps you'd care to explain how? AlphaFold does not work on new folds. It ultimately relies on mapping sequence to structure. It does it better than anyone else, and in ways a human probably couldn't, but if you give it a brand new fold with no relation to other folds, it cannot predict it. I routinely areas of extremely low confidence many of my AlphaFold models. I work in organisms that have virtually 0 sequence identity. This is a problem I deal with every day. I wish AlphaFold worked in the way you are suggesting, but it just flat out does not.

replies(3): >>flobos+7c >>dekhn+Se >>johndf+aW

>>dekhn+x5
>I would like to correct somethign here- it does predict structures de novo and predict folds that haven't been seen before. That's because of the design of the NN- it uses sequence information to create structural constraints. If those constraints push the modeller in the direction of a novel fold, it will predict that.

Could you expand on this? Basically it looks at the data, and figures out what's an acceptable position in 3D space for residues to occupy, based on what's known about other structure?

I will update my original post to point out I may be not entirely correct there.

The distinction I'm trying to make is that there's a difference between looking at pre-existing data and modeling (ultimately homology modeling, but maybe slightly different) and understanding how protein folding works, being able to predict de novo how an amino acid sequence will become a 3D structure.

Also thank you for contacting CASP about this.

replies(2): >>dekhn+8b >>bawolf+hy1

>>COGlor+g8
> Yes, it has a very nice model of what sequences should look like in 3D.

A structural model, you would say.

> That model is derived from experimental data.

That doesn’t make it a template-based model, or a homology one.

> if I give AlphaFold an MSA of a new, unknown protein fold (substantively away from any known fold), it cannot predict it

That will depend on the number of effective sequences found to derive couplings. Domains with novel folds usually have a low number of remotely homolog sequences and for that reason the method will fail, not just because they are novel.

replies(1): >>COGlor+Ft

>>ramraj+F5
> Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.

It's really not - have you played around with AF at all? Made mutations to protein structures and asked it to model them? Go look up the crystal structures for important proteins like FOXA1 [1], AR [2], EWSR1 [3], etc (i.e. pretty much any protein target we really care about and haven't previously solved) and tell me with a straight face that AF has "solved" protein folding - it's just a fancy language model that's pattern matching to things it's already seen solved before.

signed, someone with a PhD in biochemistry.

[1] https://alphafold.ebi.ac.uk/entry/P55317 [2] https://alphafold.ebi.ac.uk/entry/P10275 [3] https://alphafold.ebi.ac.uk/entry/Q01844

replies(1): >>ramraj+JG

>>COGlor+0a
From what I can tell, the model DM built is mining subtle relationships between aligned columns of multiple sequence alignments and any structural information which is tangibly related to those sequences. Those relationships can be used to infer rough atomic distances ("this atom should be within 3 and 7 angstroms of this other atom"). A large matrix (partially filled out) of distances is output, and those distances are used as constraints in a force field (which also includes lots of prior knowledge about protein structure) and then they run simulations which attempt to minimize both the force field and constraint terms.

In principle you don't even need a physical force field- if you have enough distance information between pairs of atoms, you can derive a plausible structure by embedding the distances in R3 (https://en.wikipedia.org/wiki/Distance_geometry and https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21...

Presumably, the signal they extract includes both rich local interactions (amino acids near in sequence) and distant ones inferred through sequence/structure relationships, and the constraints could in fact push a model towards a novel fold, presumably through some extremely subtle statistical relationships to other evolutionarily related proteins that adopt a different fold.

>>COGlor+n8
> It ultimately relies on mapping sequence to structure.

So does every structural prediction method.

> if you give it a brand new fold with no relation to other folds, it cannot predict it

That will depend on the number of effective sequences, not the actual fold.

> I work in organisms that have virtually 0 sequence identity.

Then the problem is low sequence coverage, not the protein fold. On a side note, there are sensitive homology search protocols that rely very little on actual sequence identity.

replies(1): >>bamboo+h61

>>COGlor+e7
> Not sure what part of "it does homology modeling 2x better" you didn't see in my comment? AlphaFold scored something like 85% in CASP in 2020, in CASP 2016, I-TASSER had I think 42%? So it's ~2x as good as I-TASSER which is exactly what I said in my comment.

Wait, stop, I don't know anything about proteins but 84% success is not ~2x better than 42%.

It doesn't really make sense to talk about 2x better in terms of success percentages, but if you want a feel, I would measure 1/error instead (a 99% correct system is 10 times better than a 90% correct system), making AlphaFold around 3.6 times better.

replies(1): >>palmtr+jq

>>COGlor+e7
What ramraj is talking about: if you go into a competitive grad program to get a PhD in structural biology, your advisor will probably expect that in 3-4 years you will: crystallize a protein of interest, collect enough data to make a model, and publish that model in a major journal. Many people in my program could not graduate until they had a Nature or Science paper (my advisor was not an asshole, I graduated with just a paper in Biochemistry).

In a sense both of you are right- DeepMind is massively overplaying the value of what they did, trying to expand its impact far beyond what they actually achieved (this is common in competitive biology), but what they did was such an improvement over the state of the art that it's considered a major accomplishment. It also achieved the target of CASP- which was to make predictions whose scores are indistinguishable from experimentally determined structures.

I don't think academics thought CASP was unwinnable but most groups were very surprised that an industrial player using 5 year old tech did so well.

replies(3): >>flobos+zf >>lucidr+eh >>valara+cl

>>COGlor+n8
No organisms have virtually 0 sequence identity. That's nonsense. Can you give an example? n Even some random million-year-isolated archae shares the majority of its genes with common bacteria.

replies(2): >>biomcg+cy >>COGlor+Ek1

>>dekhn+Od
Hear, hear. This is probably the best take.

>>COGlor+(OP)
Yup. It’s great, but there are still many aspects to unpack and work on. Hence why Rosetta is a thing.

replies(1): >>flobos+ti

>>dekhn+Od
To add to this, the deep learning field has already moved on towards MSA-less structure prediction. None of this would be possible without building on top of the work open sourced by Deepmind.

https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1 https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1

To be overly dismissive is to lack imagination.

replies(1): >>tomp+Rk1

>>zack-m+If
Rosetta methods are also moving towards ML. Here’s an article from last week: https://www.science.org/doi/10.1126/science.abn2100

>>dekhn+Od
> What ramraj is talking about: if you go into a competitive grad program to get a PhD in structural biology, your advisor will probably expect that in 3-4 years you will: crystallize a protein of interest, collect enough data to make a model, and publish that model in a major journal.

All of that is generally applicable to molecular biology in general, and I don't see how the field of structural biology is especially egregious, the way ramraj is making it out to be.

replies(2): >>flobos+7p >>dekhn+yq

>>valara+cl
Protein crystallization can be very difficult and there is no general solution. Kits that screen for crystal growth conditions usually help but optimization is needed in most cases. Then, that crystal must have certain properties that allow for good data acquisition at the X-ray facility. That’s another problem by itself and months or years can pass until you get a suitable protein crystal and X-ray diffraction dataset where you can model your structure.

replies(1): >>valara+Pr

>>underd+6d
I think odds ratio ( p/(1-p) ) is the thing I'd use here. It gives the right limiting behavior (at p ~= 0, doubling p is twice as good, and at p~=1, halving 1-p is twice as good) and it's the natural way to express Bayes rule, meaning you can say "I'm twice as sure (in odds ratio terms) based on this evidence" and have that be solely a property of the update, not the prior.

replies(3): >>camjw+4h1 >>bscphi+nD3 >>underd+2G7

>>valara+cl
I did rotations in multiple types of lab as part of my program and I can't say I ever found that students in regular molecular biology labs had nearly as hard a time as structural biologists; SB is its own class of hell. Given the number of papers published in molecular biology that turn out to be "gel was physically cut and reasssembled to show the results the authors desired" (it's much harder to cheat on a protein structure)...

replies(1): >>valara+gE

>>flobos+7p
I'm familiar with protein crystallization and the difficulties associated with it. What I don't agree with is the characterization of the field as especially difficult, above and beyond modern biology in general. Nor can I support the assertion that structural biology students are subject to special abuse that regular grad students are not.

> ... can be very difficult and there is no general solution

This is true of pretty much any graduate work in molecular biology.

replies(2): >>flobos+Eu >>ramraj+SJ

>>flobos+Ja
>Domains with novel folds usually have a low number of remotely homolog sequences and for that reason the method will fail, not just because they are novel.

How can you say this but not believe it's doing homology modeling?

replies(1): >>flobos+Px

>>valara+Pr
> Nor can I support the assertion that structural biology students are subject to special abuse that regular grad students are not.

I didn’t say anything regarding that.

> This is true of pretty much any graduate work in molecular biology.

Just to elaborate my point: The process of protein cristallization is not understood at a level that allows the design of general and reproducible protocols. This inherent obscurity means that every new protein needs to undergo an ad hoc, heuristic, iterative process to obtain high quality crystals. This is an early methodological hurdle, at a stage where other routine procedures in biochemistry or molecular biology are usually successful.

replies(2): >>dekhn+nz >>valara+1B

>>COGlor+Ft
Because homology search is not homology modelling. And a multiple sequence alignment is not a structural (i.e, with three-dimensional coordinates) template.

replies(1): >>ssivar+QM

>>dekhn+Se
Organisms, yes. Individual genes within an organism may have no sequence identity to genes in other organisms (outside of what you would expect at random). See: https://en.wikipedia.org/wiki/Orphan_gene

replies(1): >>dekhn+az

>>COGlor+(OP)
“Disclaim” stopped me.

Disclaim means to deny or renounce.

replies(2): >>Uehrek+KO >>Andory+kK1

>>biomcg+cy
Yes, that's what I thought. I worked with m. genitalium and we were always looking for proteins that had no homology or no existing structure (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20...)

>>flobos+Eu
I said that. We had a saying in grad school, "the very best protein structures are crystallized from postdoc tears".

replies(1): >>valara+QB

>>flobos+Eu
> I didn’t say anything regarding that.

I know you didn't - this was one of the claims of ramraj I was responding to.

> The process of protein cristallization is not understood at a level that allows the design of general and reproducible protocols. This inherent obscurity means that every new protein needs to undergo an ad hoc, heuristic, iterative process to obtain high quality crystals. This is an early methodological hurdle, at a stage where other routine procedures in biochemistry or molecular biology are usually successful.

I don't disagree, though I would suggest that there's just as much grunt work, frustration, and hand wringing in other fields of molecular biology at the graduate level and above. Even if other fields have reproducible protocols established, that's not what gets papers published. With the possible exception of clinical samples, more often than not we have no clue if the analyses we're doing will yield anything, and the high risk zone is where all grad students live.

>>dekhn+nz
As a current postdoc (genetics) I think postdoc tears are the fuel that academia runs on - as well as those of our significant others and kids.

>>dekhn+yq
I think this is highly subjective and that every field has its own special hells. For example, in computational biology it's a lot easier to generate results (when things actually work) but conversely it's a lot harder to convince journals. The burden of proof required to publish is sometimes ridiculously high - I had a paper spend almost 3 years in review.

>>shpong+3b
I can see the loops in these structures. I dont see the problem. It still added a structure to every embl page, and people are free to judge the predictions themselves. For all I care (ostensibly as the end customer of these structures) I don’t mind having a low confidence structure for any arbitrary protein at all. It’s only marginally less useful to actual biology than full on X-ray structures anyway.

replies(1): >>entee+VN1

>>COGlor+(OP)
These threads are always the same: lots of comments about protein folding, how amazing DeepMind is, how AlphaFold is a success story, how it has flipped an entire field on it's head, etc.

I don't think that's necessarily so - there is a lot of justified scepticism about the wilder claims of ML in this forum; it is in fact quite difficult at times to know as an outsider to the field in question how kneejerk it is.

>>valara+Pr
In most other sub fields you don’t get to not publish if exactly one endpoint never comes to pass. I know I didn’t have something like that, and most of my non crystallographer friends didn’t.

There’s a lot of structural biology apologists here in this thread. Happy to crap on DeepMind but not ready to take criticism of their own field.

For anyone outside of the field wanting to learn more, check out this documentary: https://en.m.wikipedia.org/wiki/Naturally_Obsessed

replies(1): >>valara+VU

>>COGlor+(OP)
Additionally, folding doesn't focus on what matters. Generally you want to understand the active site, you already know the context (globular, membrane, embedded, conjugated) of the protein. It is interesting whether the folding could help identify active sites for further analysis. But -- I don't think alphago is identifying new active sites or improving our understanding of their nuances.

>>flobos+Px
For someone who knows very little about this field, could you elaborate on what specific aspect of “homology modeling” AF violates/circumvents which makes you call it “homology search” instead?

replies(1): >>flobos+RS

>>COGlor+(OP)
Right, but even a speed up / quality increase can flip workflows on their head. Take ray tracing for example, when you speed it up by an order of magnitude, you can suddenly go from taking a break every time you want to render a scene, vs being able to iteratively work on a scene and preview it as you work.

>>wrycod+Ay
Can we just chill on the whole “using this single word incorrectly breaks your whole argument” thing?

A lot of folks on HN end posts about a company with a sentence like “Disclaimer: I used to work for X”. This language (probably taken from contract law or something) is meant an admission of possible bias but in practice is also a signal that this person may know what they’re talking about more-so than the average person. After reading a lot of posts like this, it might feel reasonable for someone to flip the word around say something like “I need to disclaim…” when beginning a post, in order to signal their proximity to a topic or field as well as any sort of insider bias they may possess.

So sure, “I need to disclose” would’ve been the better word choice, but we all knew what GP was saying. It seems pedantic to imply otherwise.

replies(2): >>alasda+lR >>wrycod+KO1

>>COGlor+(OP)
I got a lot of shit (still do) when the news first broke for pushing back against the notion that AlphaFold "solved" protein folding. People really wanted to attach that word to the achievement. Thank you for providing a nuanced take on exactly why that doesn't make any sense.

>>Uehrek+KO
>we all knew what GP was saying

I was confused initially too.

>>ssivar+QM
Homology search is a method to find homologous sequences, that is, evolutionary related sequences that posess a common ancestor. This was usually done based on how identical sequences were, but newer algorithms allow to find remote homologs even when the identity between the sequences is very low. The first step in AlphaFold is to retrieve as many remotely homolog sequences as possible to generate a multiple sequence alignment (MSA) that will be used to generate the embedding.

On the other hand, homology (or comparative) modelling is a method that generates a structural model of a query sequence based on one or more experimentally solved structure of a close protein homolog. The model generation details depend on the specific protocol but, broadly speaking, spatial restraints are extracted from the template structures and mapped to the query sequence to be modelled.

Note that AlphaFold also uses a type of geometrical restraint (pairwise residue distances) in its modelling, although they are not derived from protein structures but the MSA embeddings. Both are related but are not exactly the same.

One difference between AlphaFold and homology modelling is that the latter requires templates having a certain sequence identity with the query sequence (≥30% is the rule of thumb), while the former can have in its MSA remotely homolog sequences well below any discernible identity.

>>ramraj+SJ
> In most other sub fields you don’t get to not publish if exactly one endpoint never comes to pass. I know I didn’t have something like that, and most of my non crystallographer friends didn’t.

How is this a problem unique to structural biology? In every subfield we're hoping to publish interesting results, and that endpoint is defined by the nature of the field. As a geneticist, in the early 90s, sequencing & characterizing a single bacterial gene would have been the focus of an ambitious PhD thesis and would yield multiple papers. Sequencing at that time period had a dozen points of failure and high risk to set as the goal for a thesis. Today, sequencing a whole genome is unlikely to yield a single publication. If you're setting the ability to crystallize as the single point of failure endpoint, that logic applies to every subfield. We all have something that could potentially derail our plans, and I fail to see how structural biology is unique in that respect.

> There’s a lot of structural biology apologists here in this thread. Happy to crap on DeepMind but not ready to take criticism of their own field.

I'm not a structural biologist - I'm a Geneticist who disagrees with your characterization of SB. The issues you've mentioned are not unique to SB, but apply to pretty much all subfields. I see grad students in general lament their life choices when their cell culture fails, their mice die, protocols just don't work, or their results just don't make sense.

replies(1): >>flobos+S21

>>COGlor+n8
There's hype and then there's anti-hype hype, which tries to undermine any genuine progress in a hip contrarian fashion. Eg look, I'm the only who can see the truth. There's AI hype and then there's anti-AI Gary Marcus hype, who never produces any novel criticism. It's the same banal broken record every single time put in a very self-aggrandizing manner.

DM is probably hyping it up and you are most likely hyping up your own criticism. It's a great symbiotic relationship outwardly presented as opposition.

>>ramraj+F5
Not only is CASP not "unwinnable," it's not even a contest. The criteria involved are rated as "moderately difficult." Alphafold is a significant achievement but it sure as hell hasn't "revealed the structure of the protein universe," whatever that means.

Which top labs have changed direction? Because Alphafold can't predict folds, just identify ones it's seen.

>>valara+VU
> If you're setting the ability to crystallize as the single point of failure endpoint, that logic applies to every subfield.

I agree that there are other fields with similar issues. What baffles me is how long protein crystallization has been a problem.

I’ll use your example:

Nowadays, sequencing a gene is unlikely to yield a single publication by itself but is no early point of failure. It’s a solved problem with protocols that have been thoroughly developed and explained to boredom. New early points of failure arise (sample related, maybe?).

Nowadays, determining the structure of a protein is unlikely to yield a single publication by itself but has a clear, early, unsolved point of failure. No understandable protocol other than buying $creening plate$, fetching cat whiskers, drawing a theoretical phase diagram that tells you nothing, and pray that your crystallization tray doesn’t show a scrambled egg tomorrow or in six weeks. This has been an issue for more than fifty years and almost 200k published structures. The jump you mentioned in sequencing hasn’t happened yet in protein crystallography and might never happen because our understanding of macromolecular crystallization is lacking and thus we cannot predict proper crystallization conditions.

replies(1): >>valara+n61

>>COGlor+(OP)
I'm curious to read more on the 3-4 other deep learning projects you mentioned that have had a larger impact on your fields. Can you share some links to those works?

>>COGlor+e7
I think the debate between "does amazing on metric X" versus "doesn't really understand the problem" reappears many places and doesn't have any direct way to be resolved.

That's more or less because "really understands the problem" generally winds-up being a placeholder for things the system can't. Which isn't to say it's not important. One thing that is often included in "understanding" is the system knowing the limits of its approach - current AI systems have a harder time giving a certainty value than giving a prediction. But you could have a system that satisfied a metric for this and other things would pop up - for example, what kind of certainty or uncertainty are we talking about (crucial for decision making under uncertainty).

>>flobos+7c
So then based on your counter arguments to the OP, have they mapped the entire protein universe ? Or should it say, the “already known protein universe” ?

replies(1): >>flobos+bh1

>>flobos+S21
Sure, I agree that crystallization in particular has faced this particular bottleneck for a long time. The field of SB, however, has still managed to advance massively too. For example, Cryo-EM can do things we could barely imagine a decade ago.

The point I'm trying to make is that from the perspective of a grad student, no field is devoid of risk, and it's surprisingly easy to be stuck by something that's a solved problem on paper. For example, I know of a grad student that's been trying to develop a mouse line for about a year now, and has now discovered that this strain just won't work for what they have in mind - and must now recreate the mutant combinations in a different strain that's at least a year's work - if it even works. I've heard stories of entire mouse lines die, and you're back to square one - years of work lost.

The other thing that complicates some of these fields is the massive pace of innovation they're undergoing that it is very hard for an individual lab to keep up to date. Grad students are using techniques that were published less than 5 years ago, and there's no locally available expertise to tap into. What remains the same is the level of grunt work grad students and postdocs have to do, even if the techniques get more sophisticated over time.

>>COGlor+(OP)
I asked a structural biologist friend of mine (world class lab) about the impact of alphafold.

They said it's minimal.

In most cases, having a "probably" isn't good enough. They use alphafold to get early insights, but then they still use crystallography to confirm the structure. Because at the end of the day, you need to know for sure.

replies(1): >>bawolf+Ly1

>>palmtr+jq
For the lazy, this would make alphafold 7.25x better than the previous tools

>>bamboo+h61
Neither the protein sequence nor structure spaces have been fully explored, and the sequence set of UniProt does not represent every single extant protein. My answer is “no”.

>>dekhn+Se
MKVLMKKESLPIVKPFDEVIIEVLQAPKEVEREVALKDGTIKKIQDYSIIVKPVSGKFESVTEKVTSKTEDGDEVVKPKKYDASELKDKVVMKLTQKAFEVLYDAWQNKEIGEGTKLKIKVTKKQNKTYFDEITVLDEKEEEETEEEAKVKPKPKLKG

replies(1): >>dekhn+an1

>>lucidr+eh
How do we know these "MSA-less" models aren't cheating (i.e. learning all MSAs implicitly from their training data)? If they are, they would similarly fail on any "novel" AA sequence (i.e. one without known/learned MSAs)

>>COGlor+Ek1
That's a single protein not an organism.

replies(1): >>andrew+mJ1

>>ramraj+F5
This isn’t a good use of the term gaslighting. Accusing someone of gaslighting takes what we used to call a ‘difference of opinion’ and mutates it into deliberate and wicked psychological warfare.

Incidentally, accusing someone of gaslighting is itself a form of gaslighting.

replies(1): >>tomjak+1v1

>>Gatsky+lp1
Well, it can be gaslighting but not always. A knowingly false accusation, repeated often enough and in a way to make the accused question their own perception of reality, would be gaslighting.

>>COGlor+0a
> The distinction I'm trying to make is that there's a difference between looking at pre-existing data and modeling (ultimately homology modeling, but maybe slightly different) and understanding how protein folding works, being able to predict de novo how an amino acid sequence will become a 3D structure.

Your objection is that alphafold is a chinese room?

What does that matter? Either it generates useful results or it doesn't. That is the metric we should evaluate it on.

replies(1): >>COGlor+5O1

>>simsla+O71
I'm not a biologist, but that doesn't sound minimal if crystallography is expensive.

It sounds like how we model airplanes in computers, but still test the real thing - i wouldn't call the impact of computer modelling on airplane design to be minimal.

>>COGlor+(OP)
as an outsider learning more about protein folding, could you elaborate on the assertion that the sequence is (mostly) all you need (transformer/ML reference intended).

doesn't this assume the final fold is static and invariant of environmental and protein interactions?

put another way, how do we know that a protein does not fold differently under different environmental conditions or with different molecular interactions?

i realize this is a long-held assumption, but after studying scientific research for the past year, i realize many long-held assumptions aren't supported by convincing evidence.

>>dekhn+an1
They obviously mean organisms that have notable numbers of proteins with virtually no sequence identity. The difference is only germane to the conversation if you're looking for something to nitpick. The only point of bringing it up was that they encounter non-trivial numbers of really weird proteins.

>>wrycod+Ay
Or to make a disclaimer.. like the OP post did?

Merriam webster[1]: " Definition of disclaim

intransitive verb 1 : to make a disclaimer ... "

[1]: https://www.merriam-webster.com/dictionary/disclaim

replies(1): >>wrycod+oP1

>>ramraj+JG
> It’s only marginally less useful to actual biology than full on X-ray structures anyway.

I'm not sure what you're implying here. Are you saying both types of structures are useful, but not as useful as the hype suggests, or that an X-Ray Crystal (XRC) and low confidence structures are both very useful with the XRC being marginally more so?

An XRC structure is great, but it's a very (very) long way from getting me to a drug. Observe the long history of fully crystalized proteins still lacking a good drug. Or this piece on the general failure of purely structure guided efforts in drug discovery for COVID (https://www.science.org/content/blog-post/virtual-screening-...). I think this tech will certainly be helpful, but for most problems I don't see it being better than a slightly-more-than-marginal gain in our ability to find medicines.

Edit: To clarify, if the current state of the field is "given a well understood structure, I often still can't find a good medicine without doing a ton of screening experiments" then it's hard to see how much this helps us. I can also see several ways in which a less than accurate structure could be very misleading.

FWIW I can see a few ways in which it could be very useful for hypothesis generation too, but we're still talking pretty early stage basic science work with lots of caveats.

Source: PhD Biochemist and CEO of a biotech.

>>bawolf+hy1
Because it's being presented as something that it isn't. It's a better way to analyze data that we got experimentally, and to predict how new data will fit into what we know. It's not de novo understanding, which is the holy grail and what the field is ultimately trying to accomplish. It's Tesla's adaptive cruise control being sold as full self driving. Yes, they are close things - one is an approximation of the other, but being really really good at adaptive cruise control has basically zero carryover to full self driving. FSD isn't a linear progression from adaptive cruise control, and understanding how proteins fold isn't a linear progression from AlphaFold sequence homology/homology modeling. It's not even close to the same thing, AlphaFold doesn't even move the needle for our understanding of how proteins fold, and yet it's sucking all the air out of the conversation by presenting itself like it solved this problem.

It's a really good, fancy model completely reliant on data we already have empirically (and therefore subject to all the same biases as well).

replies(1): >>bawolf+PY1

>>Uehrek+KO
Let me translate. They said, “I will disclaim I am a professional structural biologist that works in this field every day.”

That is synonymous with saying, “I will deny I am a professional structural biologist that works in this field every day.”

The person posting is actually a structural biologist. What they stated was cognitively dissonant with the intent of their post, and that’s what stopped me.

I don’t pay attention to typos or minor usage issues, but in this case, I read two more sentences and said, “What??”

EDIT: Two more things. First, I found the post interesting and useful. I didn’t say anything about breaking the argument.

Second, “I need to disclose…” is the exact opposite of what they said.

replies(1): >>Totemp+Oyd

>>Andory+kK1
The verb was used transitively.

Transitive verb:

2 : DENY, DISAVOW disclaimed any knowledge of the contents of the letter

>>COGlor+5O1
I'm assuming "de novo" means from first principles?

i really don't think anyone is presenting alphafold as if its a physics simulator operating from first principles.

Like obviously alphafold does not "understand". Maybe i have blinders on for being in the computer field, but i would assume that it goes without saying that a statistical deep learning AI model does not tell us how to solve the problem from first principles.

Like yes, alphafold isn't the final chapter in protein folding and that is obvious. But it seems a stretch to dismiss it on those grounds. If that's the metric we're going with then we can dismiss pretty much everything that has happened in science for the past thousand years.

> re self driving car metaphor

I think this is a bad metaphor for your purposes, because self-driving cars aren't de novo understanding, and arguably do have some carry over from things like adaptive cruise control.

>>COGlor+e7
> AlphaFold scored something like 85% in CASP in 2020, in CASP 2016, I-TASSER had I think 42%? So it's ~2x as good as I-TASSER

As someone who doesn't know proteins, but is decent at math, I would not describe it this way. You are assuming a linear relationship between effort and value, but more often than not, effort has diminishing returns. 80dB is not 2x as loud as 40 dB. An 8K image doesn't have 2x the fidelity of a 4K image. If Toyota unveiled a new engine that was 60% efficient tomorrow, no one in their right mind would say "eh, it's just 2x better". If we came out with a CPU that could clock up to 10Ghz we wouldn't say "meh, that's just 2x what we had".

Without being able to define the relationship here, I could just as well say that 85% is 1000x better than 42%. There's just no way to put a number on it. What we can say is that we completely blew all projections out of the water.

Again, I'm not someone working with proteins, but to me it sounds as revolutionary as a 60%+ efficient engine, or a 10Ghz CPU. No one saw it coming or thought it feasible with current technology.

>>palmtr+jq
Excellent comment. I think the issue is that "better" is underspecified and needs some precisification to be useful. The metric you are using here is the proper response to the question "how many times more surprising is it when method A fails than method B?". This is in many cases what we care about. Probably, it's what we care about here. The odds ratio seems to do a good job of capturing the scale of the achievement.

On the other hand, it's not necessarily the only thing we might care about under that description. If I have a manufacturing process that is 99.99% successful (the remaining 0.01% has to be thrown out), it probably does not strike me as a 10x improvement if the process is improved to 99.999% success. What I care about is the cost to produce the average product that can be sent to market, and this "10x improvement" changes that only a very small amount.

>>palmtr+jq
TIL, thanks for this.

>>wrycod+KO1
Looks like the downvotes say that your interpretation of this language in this context is not the most common interpretation.

>>COGlor+(OP)
Disclaimer: I'm a professional (computational) structural biologist. My opinion is slightly different.

The problem with the structure prediction problem is not a loss/energy function problem, even if we had an accurate model of all the forces involved we'd still not have an accurate protein structure prediction algorithm.

Protein folding is a chaotic process (similar to the 3 body problem). There's an enormous number of interactions involved - between different amino acids, solvent and more. Numerical computation can't solve chaotic systems because floating point numbers have a finite representation, which leads to rounding errors and loss of accuracy.

Besides, Short range electro static and van der waals interactions are pretty well understood and before alphafold many algorithms (like Rosetta) were pretty successful in a lot of protein modeling tasks.

Therefore, we need a *practical* way to look at protein structure determination that is akin to AlphaFold2.