AlphaFold reveals the structure of the protein universe

>>biffta+q
This video goes some way to explaining how they know the structures are correct: https://www.youtube.com/watch?v=vXZzftX03VY

>>luma+T6
As a simple example, one measure used to compare a predicted structure against a reference is the RMSD (root mean square deviation).

https://en.m.wikipedia.org/wiki/Root-mean-square_deviation_o...

The lower the RMSD between two structures, the better (up to some limit).

>>f38zf5+hb
I recognize your superior knowledge in the topic and assume you're right.

But you also ignore where we're at in the standard cycle:

https://phdcomics.com/comics/archive_print.php?comicid=1174

;)

>>f38zf5+hb
edit: I should have read the post first! What do you mean 'only globular proteins'? They say they have predictions for all of UniProt...

---------------

Yes, the idea of a 'protein universe' seems like it should at least encompass 'fold space'.

For example, WR Taylor : https://pubmed.ncbi.nlm.nih.gov/11948354/

I think the rough estimate was that there were around 1000 folds - depending on how fine-grained you want to go.

Absolutely agree, though, that a lot of proteins are hard to crystalise (i understand) due to being trans-membrane or just the difficulty of getting the right parameters for the experiment.

>>dalbas+l7
https://en.m.wikipedia.org/wiki/Protein_folding

>>dalbas+l7
This might be an interesting resource for you: https://pdb101.rcsb.org/

>>t00+fg
the groundworks, at least partially, happen as you typed this: https://www.nature.com/articles/d41586-021-01627-2

>>f38zf5+ge
> but other famous labs have already moved to ML predictions and are competitive with DeepMind now.

it seems obvious this was going to happen, because https://github.com/deepmind/alphafold

>>MindGo+(OP)
Obtaining this dataset prior to alphafold would have cost on the order of $200 trillion. https://twitter.com/wintonARK/status/1552653527670857729

Anyone knowledgeable know if this estimate is accurate? Insane if true

>>lamena+jf
Most of them are not, just estimations based on previous results given sequences with known structure.

Every couple years there is a massive competition called CASP where labs submit previously unresolved protein structures derived from experimental EM, x-ray crystallography, or NMR studies and other labs attempt to predict these structures using their software. AlphaFold2 absolutely destroyed the other labs in the main contest (regular monomeric targets, predominantly globular) for structure resolution two years ago, in CASP 14.

https://predictioncenter.org/casp14/zscores_final.cgi

The latest contest, CASP15, is currently underway and expected to end this year. As with all ML, the usual caveats apply to the models Google generated -- the dangers of overfitting to existing structures, artifacts based on the way the problem was modelled, etc

>>f38zf5+hb
People have been making grand statements about the structure of the protein universe for quite some time (I've seen a fair number of papers on this, such as https://faseb.onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... and https://faseb.onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... from a previous collaborator of mine).

Google didn't solve the structure of the protein universe (thank you for saying that). But the idea of the protein structure universe is fairly simple- it's a latent space that allows for direct movement over what is presumably the rules of protein structures along orthogonal directions. It would encompass all the "rules" in a fairly compact and elegant way. Presumably, superfamilies would automagically cluster in this space, and proteins in different superfamilies would not.

>>klemol+Pm
https://alphafold.ebi.ac.uk/

>>jebark+Vo
Yes, very much so. Even for proteins that seems like they are just scaffolding for a catalytic centre can have important dynamics.

A classic example is haemoglobin, that 'just' binds to oxygen at the iron in the middle of the haem. Other binding sites remote from the oxygen binding one can bind to other molecules - notably carbon dioxide. The 'Bohr effect' mechanism is outlined here : https://en.wikipedia.org/wiki/Bohr_effect#Allosteric_interac...

Even at the lowest level, there is some evidence that ligand binding can affect the structure of the backbone of the protein. For example, peptide plane flipping https://en.wikipedia.org/wiki/Peptide_plane_flipping although I'm not sure where the research is on this nowadays.

>>jarenm+fb
This is definitely one of the most exciting spaces in AI right now. Another somewhat-related startup is PostEra (medicinal chemistry for drug discovery via AI) https://postera.ai/about/

>>MindGo+(OP)
> Today, I’m incredibly excited to share the next stage of this journey. In partnership with EMBL’s European Bioinformatics Institute (EMBL-EBI), we’re now releasing predicted structures for nearly all catalogued proteins known to science, which will expand the AlphaFold DB by over 200x - from nearly 1 million structures to over 200 million structures - with the potential to dramatically increase our understanding of biology.

And later:

> Today’s update means that most pages on the main protein database UniProt will come with a predicted structure. All 200+ million structures will also be available for bulk download via Google Cloud Public Datasets, making AlphaFold even more accessible to scientists around the world.

This is the actual announcement.

UniProt is a large database of protein structure and function. The inclusion of the predicted structures alongside the experimental data makes it easier to include the predictions in workflows already set up to work with the other experimental and computed properties.

It's not completely clear from the article whether any of the 200+ million predicted structures deposited to UniProt have not be previously released.

Protein structure determines function. Before AlphaFold, experimental structure determination was the only option, and that's very costly. AlphaFold's predictions appears to be good enough to jumpstart investigations without an experimental structure determination. That has the potential to accelerate many areas of science and could percolate up to therapeutics.

One area that doesn't get much discussion in the press is the difference between solid state structure and solution state structure. It's possible to obtain a solid state structure determination (x-ray) that has nothing to do with actual behavior in solution. Given that AlhpaFold was trained to a large extent on solid state structures, it could be propagating that bias into its predicted structures.

This paper talks about that:

> In the recent Critical Assessment of Structure Prediction (CASP) competition, AlphaFold2 performed outstandingly. Its worst predictions were for nuclear magnetic resonance (NMR) structures, which has two alternative explanations: either the NMR structures were poor, implying that Alpha-Fold may be more accurate than NMR, or there is a genuine difference between crystal and solution structures. Here, we use the program Accuracy of NMR Structures Using RCI and Rigidity (ANSURR), which measures the accuracy of solution structures, and show that one of the NMR structures was indeed poor. We then compare Alpha-Fold predictions to NMR structures and show that Alpha-Fold tends to be more accurate than NMR ensembles. There are, however, some cases where the NMR ensembles are more accurate. These tend to be dynamic structures, where Alpha-Fold had low confidence. We suggest that Alpha-Fold could be used as the model for NMR-structure refinements and that Alpha-Fold structures validated by ANSURR may require no further refinement.

https://pubmed.ncbi.nlm.nih.gov/35537451/

>>jcranm+hp
That is only true because of our current tools and capabilities. With improved manufacturing techniques and AlphaFold++ I think biologics will dominate. Even still, there are ~2000 approved biologics [0].

[0] - https://purplebooksearch.fda.gov/advanced-search

>>crispy+Ww
The ribbons and helices you see in those pictures are abstract representations of the underlying positions of specific arrangements of carbon atoms along the backbone.

There are tools such as DSSP https://en.wikipedia.org/wiki/DSSP_(hydrogen_bond_estimation... which will take out the 3d structure determined by crystallography and spit out hte ribbons and helices- for example, for helices, you can see a specific arrangement of carbons along the protein's backbone in 3d space (each carbon interacts with a carbon 4 amino acids down the chain).

Protein motion at room temperature varies depending on the protein- some proteins are rocks that stay pretty much in the same single conformation forever once they fold, while others do thrash around wildly and others undergo complex, whole-structure rearrangements that almost seem magical if you try to think about them using normal physics/mechanical rules.

Having a magical machine that could output the full manifold of a protein during the folding process at subatomic resolution would be really nice! but there would be a lot of data to process.

>>crispy+Ww
In addition to /u/dekhn 's excellent description, this phenomenon is referred to as a protein's "secondary structure" [0]

[0] https://en.m.wikipedia.org/wiki/Protein_secondary_structure

>>crispy+Ww
Short answer is that the ribbon representation is a visual simplification based on known structures -- they are actually composed of atoms.

They certainly do "thrash around", but that thrashing is constrained by the bonds that are formed, which greatly limits the degrees of freedom. Here's a short video of a simulation to demonstrate:

https://www.youtube.com/watch?v=fggqPtaZj8g

>>lamena+jf
For a lot of X-ray crystallography cases, some of the difficulty is working out with no prior information, the actual structure from the collected data. This makes a lot of that... much easier because with https://en.wikipedia.org/wiki/Molecular_replacement something that is "close, but not correct" can be used to bootstrap the actual structure from.

>>MindGo+(OP)
Today I learned that there are bacteria that have a protein helping to form ice on plants [1] to destroy them and extract nutrients (however I didn't understand how bacteria themselves survive this).

Machine learning typically uses existing data to predict new data. Please explain: Does it mean that AlphaFold can only use known types of interactions between atoms and will mispredict the structure of proteins that use not yet known interactions?

And why we cannot just simulate protein behaviour and interactions using quantum mechanics?

[1] https://pubs.acs.org/doi/10.1021/acs.jpcb.1c09342

>>MindGo+(OP)
Before my comment gets dismissed, I will disclaim I am a professional structural biologist that works in this field every day.

These threads are always the same: lots of comments about protein folding, how amazing DeepMind is, how AlphaFold is a success story, how it has flipped an entire field on it's head, etc. The language from Google is so deceptive about what they've actually done, I think it's actually intentionally disingenuous.

At the end of the day, AlphaFold is amazing homology modeling. I love it, I think it's an awesome application of machine learning, and I use it frequently. But it's doing the same thing we've been doing for 2 decades: pattern matching sequences of proteins with unknown structure to sequences of proteins with known structure, and about 2x as well as we used to be able to.

That's extremely useful, but it's not knowledge of protein folding. It can't predict a fold de novo, it can't predict folds that haven't been seen (EDIT: this is maybe not strictly true, depending on how you slice it), it fails in a number of edge cases (remember, in biology, edge cases are everything) and again, I can't stress this enough, we have no new information on how proteins fold. We know all the information (most of at least) for a proteins final fold is in the sequence. But we don't know much about the in-between.

I like AlphaFold, it's convenient and I use it (although for anything serious or anything interacting with anything else, I still need a real structure), but I feel as though it has been intentionally and deceptively oversold. There are 3-4 other deep learning projects I think have had a much greater impact on my field.

EDIT: See below: https://news.ycombinator.com/item?id=32265662 for information on predicting new folds.

>>crispy+Ww
> I recall from school the equipartition theorem-- 1/2 KT of kinetic energy for each degree of freedom. These things obviously have many degrees of freedom. So wouldn't they be "thrashing around" like rag doll in a blender at room temperature?

It's funny you say that, because the first image on the English Wikipedia page for Equipartition Theorem[1] is an animation of the thermal motion of a peptide.

[1]: https://en.wikipedia.org/wiki/Equipartition_theorem

>>MindGo+(OP)
Now we can start guessing what futures they are betting on: these, in which open-sourcing the whole thing commoditises critical complements.

---

https://www.gwern.net/Complement

>>mupuff+HE
1) Isonet - takes low SNR cryo-electron tomography images (that are extremely dose limited, so just incredibly blurry and frequently useless) and does two things:

* Deconvolutes some image aberrations and "de-noises" the images

* Compensates for missing wedge artifacts (missing wedge is the fact that the tomography isn't done -90° --> +90°, but usually instead -60° --> +60°, leaving a 30° wedge on the top and bottom of basically no information) which usually are some sort of directionality in image density. So if you have a sphere, the top and bottom will be extremely noisy and stretched up and down (in Z).

https://www.biorxiv.org/content/10.1101/2021.07.17.452128v1

2) Topaz, but topaz really counts as 2 or 3 different algorithms. Topaz has denoising of tomograms and of flat micrographs (i.e. images taken with a microscope, as opposed to 3D tomogram volumes). That denoising is helpful because it increases contrast (which is the fundamental problem in Cryo-EM for looking at biomolecules). Topaz also has a deep learning particle picker which is good at finding views of your protein that are under-represented, or otherwise missing, which again, normally results in artifacts when you build your 3D structure.

https://emgweb.nysbc.org/topaz.html

3) EMAN2 convolutional neural network for tomogram segmentation/Amira CNN for segmentation/flavor of the week CNN for tomogram segmentation. Basically, we can get a 3D volume of a cell or virus or whatever, but then they are noisy. To do anything worthwhile with it, even after denoising, we have to say "this is cell membrane, this is virus, this is nucleic acid" etc. CNNs have proven to be substantially better at doing this (provided you have an adequate "ground truth") than most users.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5623144/

>>ramraj+oJ
> Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.

It's really not - have you played around with AF at all? Made mutations to protein structures and asked it to model them? Go look up the crystal structures for important proteins like FOXA1 [1], AR [2], EWSR1 [3], etc (i.e. pretty much any protein target we really care about and haven't previously solved) and tell me with a straight face that AF has "solved" protein folding - it's just a fancy language model that's pattern matching to things it's already seen solved before.

signed, someone with a PhD in biochemistry.

[1] https://alphafold.ebi.ac.uk/entry/P55317 [2] https://alphafold.ebi.ac.uk/entry/P10275 [3] https://alphafold.ebi.ac.uk/entry/Q01844

>>COGlor+JN
From what I can tell, the model DM built is mining subtle relationships between aligned columns of multiple sequence alignments and any structural information which is tangibly related to those sequences. Those relationships can be used to infer rough atomic distances ("this atom should be within 3 and 7 angstroms of this other atom"). A large matrix (partially filled out) of distances is output, and those distances are used as constraints in a force field (which also includes lots of prior knowledge about protein structure) and then they run simulations which attempt to minimize both the force field and constraint terms.

In principle you don't even need a physical force field- if you have enough distance information between pairs of atoms, you can derive a plausible structure by embedding the distances in R3 (https://en.wikipedia.org/wiki/Distance_geometry and https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21...

Presumably, the signal they extract includes both rich local interactions (amino acids near in sequence) and distant ones inferred through sequence/structure relationships, and the constraints could in fact push a model towards a novel fold, presumably through some extremely subtle statistical relationships to other evolutionarily related proteins that adopt a different fold.

>>dekhn+xR
To add to this, the deep learning field has already moved on towards MSA-less structure prediction. None of this would be possible without building on top of the work open sourced by Deepmind.

https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1 https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1

To be overly dismissive is to lack imagination.

>>zack-m+rT
Rosetta methods are also moving towards ML. Here’s an article from last week: https://www.science.org/doi/10.1126/science.abn2100

>>codedo+5C
>And why we cannot just simulate protein behaviour and interactions using quantum mechanics?

If you wanted to simulate the behaviour of an entire protein using quantum mechanics, the sheer number of calculations required would be infeasible.

For what it's worth, I have a background in computational physics and am studying a PhD in structural biology. For any system (of any size) that you want to simulate, you have to consider how much information you're willing to 'ignore' in order to focus on the information you would like to 'get out' of a set of simulations. Being aware of the approximations you make and how this impacts your results is crucial.

For example, if I am interested in how the electrons of a group of Carbon atoms (radius ~ 170 picometres) behave, I may want to use Density Functional Theory (DFT), a quantum mechanical method.

For a single, small protein (e.g. ubiquitin, radius ~ 2 nanometres), I may want to use atomistic molecular dynamics (AMD), which models the motion of every single atom in response to thermal motion, electrostatic interactions, etc using Newton's 2nd law. Electron/proton detail has been approximated away to focus on overall atomic motion.

In my line of work, we are interested in how big proteins (e.g. the dynein motor protein, ~ 40 nanometres in length) move around and interact with other proteins at longer time (micro- to millisecond) and length (nano- to micrometre) scales than DFT or AMD. We 'coarse-grain' protein structures by representing groups of atoms as tetrahedra in a continuous mesh (continuum mechanics). We approximate away atomic detail to focus on long-term motion of the whole protein.

Clearly, it's not feasible to calculate the movement of dynein for hundreds of nanoseconds using DFT! The motor domain alone in dynein contains roughly one million atoms (and it has several more 'subunits' attached to it). Assuming these are mostly Carbon, Oxygen or Nitrogen, then you're looking at around ten million electons in your DFT calculations, for a single step in time (rounding up). If you're dealing with the level of atomic bonds, you're probably going to a use time steps between a femto- (10^-15 s) or picosecond (10^-12 s). The numbers get a bit ridiculous. There are techniques that combine QM and AMD, although I am not too knowledgeable in this area.

Some further reading, if you're interested (I find Wikipedia articles on these topics to generally be quite good):

DFT: https://en.wikipedia.org/wiki/Density_functional_theory

Biological continuum mechanics: https://doi.org/10.1371/journal.pcbi.1005897

Length scales in biological simulations: https://doi.org/10.1107/S1399004714026777

Electronic time scales: https://www.pnas.org/doi/10.1073/pnas.0601855103

>>klemol+Pm
* https://pdb101.rcsb.org/motm/

* https://ccsb.scripps.edu/goodsell/

* https://pdb101.rcsb.org/sci-art/geis-archive/irving-geis

* https://www.digizyme.com/portfolio.html

* https://www.drewberry.com/

* https://biochem.web.utah.edu/iwasa/projects.html

* http://onemicron.com/

* The art of Jane Richardson, of which I couldn’t find a link

* This blog has plenty of good links: https://blogs.oregonstate.edu/psquared/

>>MindGo+(OP)
Many thanks to Deepmind for releasing predicted structures of all known protein monomers. What I'd like next is for Alphafold (or some other software) to be able to show us multimeric structures based on the single monomer/subunit predictions and protein-protein interactions (i.e. docking). For example the one I helped work on back in my structural biology days was the circadian clock protein KaiC : https://www.rcsb.org/structure/2GBL, that's the "complete" hexameric structure that shows how each of the subunits pack. The prediction for the single monomer that forms a hexamer is very close to the experimental https://alphafold.ebi.ac.uk/entry/Q79PF4 and in fact shows the correct structure of AA residues 500 - 519 which we were never able to validate until 12 years later (https://www.rcsb.org/structure/5C5E) when we expressed those residues along with another protein called KaiA which we knew binds to the "top" CII terminal (AAs 497-519) of KaiC. If we would have had this data then, it would have allowed us to not only make better predictions about biological function and protein-protein interactions but would have helped better guide future experiments.

What we can do with this data now is use methods such as cryo-em to see the "big picture", i.e. multi-subunit protein-protein interactions where we can plug in the Alphafold predicted structure into the cryo-em 3d density map and get predicted angstrom level views of what's happening without necessarily having to resort to slower methods such as NMR or x-ray crystallography to elucidate macromolecular interactions.

A small gripe about the alphafold ebi website: it doesn't seem to show the known experimental structure, it just shows "Experimental structures: None available in PDB". For example the link to the alphafold structure above should link to the 2GBL, 1TF7, or any of the other kaic structures from organism PCC7942 at RCSB. This would require merging/mapping data from RCSB with EBI and at least doing some string matching, hopefully they're working on it!

>>dekhn+BS
Organisms, yes. Individual genes within an organism may have no sequence identity to genes in other organisms (outside of what you would expect at random). See: https://en.wikipedia.org/wiki/Orphan_gene

>>biomcg+Vb1
Yes, that's what I thought. I worked with m. genitalium and we were always looking for proteins that had no homology or no existing structure (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20...)

>>inspir+LN
To answer my own question it looks like for folks who don’t want to wait 21 months for 21 terabytes, that it might cost approximately 1600 USD to download the full approx 20TB dataset assuming egress costs of .08 USD per GB as mentioned here: https://cloud.google.com/storage/pricing#network-egress It’s a pity it’s so expensive to download

>>valara+y51
In most other sub fields you don’t get to not publish if exactly one endpoint never comes to pass. I know I didn’t have something like that, and most of my non crystallographer friends didn’t.

There’s a lot of structural biology apologists here in this thread. Happy to crap on DeepMind but not ready to take criticism of their own field.

For anyone outside of the field wanting to learn more, check out this documentary: https://en.m.wikipedia.org/wiki/Naturally_Obsessed

>>epups+69
> demonstrates how AI can already outclass humans for certain tasks

I'm not sure how clear the edge over humans in this case is. There were some attempts at machine assisted human solving like Foldit that did produce results: https://en.wikipedia.org/wiki/Foldit#Accomplishments

>>crispy+Ww
I've been going through MIT's online Introduction to Biology course[0] that answers some of your questions here with regards to the shapes and what they signify - specifically the "Proteins and Protein Structure" lessons in the second unit, although some of the previous lectures are helpful setup as well - really interesting and engaging stuff, taught by Eric Lander (who ended up being one of the CRISPR pioneers featured in Isaacson's latest book)

[0]https://learning.edx.org/course/course-v1:MITx+7.00x+2T2022/...

>>sabujp+501
You might be interested in https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2

>>jebark+Vo
Reference: https://en.wikipedia.org/wiki/Kinesin , https://en.wikipedia.org/wiki/Dynein , https://en.wikipedia.org/wiki/Myosin

They are called motor proteins because they convert chemical energy into kinetic energy. In the case of kinesin, it forms a dimer (two copies of itself bind together to form the two "legs") and also binds to light chains (accessory proteins that modulate its behavior) so that it can walk along filaments and drag cargo around your cells. They are both proteins and more complex structures because multiple proteins are interacting, as well as binding small molecules and catalyzing them into chemical products, all to produce the motion.

>>crispy+Ww
If you want something that leaves a little less to the imagination, check out https://en.wikipedia.org/wiki/Staphylococcus_aureus_alpha_to... . It looks just like what it does: drill a giant hole in cell membranes.

Some proteins have 3D structures that look like abstract art only because we don't have an intuitive understanding of what shape and amino acids are necessary to convert chemical A to chemical B, which is the main purpose of many enzymes in the body. If you look at structural proteins or motor proteins, on the other hand, their function is clear from their shape.

There are a lot of other things you can do with the shape. If it has a pore, you can estimate the size and type of small molecule that could travel through it. You can estimate whether a binding site is accessible to the environment around it. You can determine if it forms a multimer or exists as a single unit. You can see if protein A and protein B have drastically different shapes given similar sequences, which might have implications for its druggability or understanding its function.

>>Alop3x+AW1
https://alphafold.ebi.ac.uk/entry/W6KDG8

The ribbon shape for GFP is a very cool barrel thing

>>e_i_pi+JZ1
One of my favorites are some antifreeze proteins:

* https://www.rcsb.org/structure/1m8n

* https://iiif.elifesciences.org/lax/05142%2Felife-05142-fig1-...

>>wrycod+jc1
Or to make a disclaimer.. like the OP post did?

Merriam webster[1]: " Definition of disclaim

intransitive verb 1 : to make a disclaimer ... "

[1]: https://www.merriam-webster.com/dictionary/disclaim

>>Balgai+Qo2
Bert Hubert, “Our amazing immune system”: https://berthub.eu/articles/posts/immune-system/

>>ramraj+sk1
> It’s only marginally less useful to actual biology than full on X-ray structures anyway.

I'm not sure what you're implying here. Are you saying both types of structures are useful, but not as useful as the hype suggests, or that an X-Ray Crystal (XRC) and low confidence structures are both very useful with the XRC being marginally more so?

An XRC structure is great, but it's a very (very) long way from getting me to a drug. Observe the long history of fully crystalized proteins still lacking a good drug. Or this piece on the general failure of purely structure guided efforts in drug discovery for COVID (https://www.science.org/content/blog-post/virtual-screening-...). I think this tech will certainly be helpful, but for most problems I don't see it being better than a slightly-more-than-marginal gain in our ability to find medicines.

Edit: To clarify, if the current state of the field is "given a well understood structure, I often still can't find a good medicine without doing a ton of screening experiments" then it's hard to see how much this helps us. I can also see several ways in which a less than accurate structure could be very misleading.

FWIW I can see a few ways in which it could be very useful for hypothesis generation too, but we're still talking pretty early stage basic science work with lots of caveats.

Source: PhD Biochemist and CEO of a biotech.

>>crispy+Ww
Watch this video on DNA polymerase [1]. Obviously it’s an illustration, but I think it helps answer you question because cartoons are great. (MD, not PhD biologist)

[1] https://youtu.be/sKe3UgH1AKg

>>MindGo+(OP)
Come play biotech with us and let's figure out EVERYTHING and not just protein folding, yay! https://epicquest.bio

zlacker

AlphaFold reveals the structure of the protein universe