Deep Mind has been able to make significant progress toward solving the protein folding problem. Protein folding is not solved yet.
Here I will describe :
* the complexity the protein folding problem. 10^300 folding possibilities for the the average protein based upon certain assumptions
* I will compare the other large problems to protein folding and some discussion about comprehending large numbers
Complexity of Protein Folding
the scale of the protein folding problem was fairly well defined and understood in 1969. Levinthal explained it.
Proteins are macromolecules which possess several unique properties. They are very large (containing 2,000 or more atoms) and complex. Their structures show no obvious regularity but a very subtle regularity is apparent upon close examination. We know from the fact that proteins may be crystallized and further from x-ray crystallography that each atom occupies a unique place in the relative 3-dimensional space of the molecule. If we consider a protein containing 2,000 atoms with no structural restrictions, such a macromolecule would possess 6,000 degrees of freedom. We know, however, from x-ray studies and other techniques as well, that there are indeed certain structural restrictions in a polypeptide structure. For example, if we schematically indicate a polypeptide chain as in Figure 1, we find that the 6 atoms in each unit indicated by the dotted lines lie in a common plane. Considerations of such factors allow us to predict only 450 degrees of freedom in a protein structure containing 150 amino acids, for example. Of these 450 degrees of freedom, 300 would be due to rotations and 150 would be due to relative bond angles of the side chains.
There was earlier (1960s work) attempting to predict the 3-dimensional structure of some polypeptides from primary sequence information.
If we begin with a set of bond angles and bond lengths and go to 3-dimensional coordinates (via vector matrix multiplications), we can build a 3-dimensional image and display it on a computer controlled oscilloscope. If we know the coordinates of any two atoms and their interaction energy functions, could we extend this treatment to sum the total energy of a given polypeptide or protein structure?
How accurately must we know the bond angles to be able to estimate these energies? Even if we knew these angles to better than a tenth of a radian, there would be 10^300 possible configurations in our theoretical protein. In nature, proteins apparently do not sample all of these possible configurations since they fold in a few seconds, and even postulating a minimum time for going from one conformation to another, the proteins would have time to try on the order of 108 different conformations at most before reaching their final state. We feel that protein folding is speeded and guided by the rapid formation of local interactions which then determine the further folding of the peptide. This suggests local amino acid sequences which form stable interactions and serve as nucleation points in the folding process.
Other Large Problems
The state-space complexity of a game is the number of legal game positions reachable from the initial position of the game.
The game tree size is the total number of possible games that can be played: the number of leaf nodes in the game tree rooted at the game’s initial position.
Enigma could not encode the same letter back to itself.
Oerating shortcomings greatly helped in breaking engima broken.
* The production of an early Enigma training manual containing an example of plaintext and its genuine ciphertext, together with the relevant message key. When Rejewski was given this in December 1932, it “made [his reconstruction of the Enigma machine] somewhat easier”.
Repetition of the message key as described in Rejewski’s characteristics method, above. (This helped in Rejewski’s solving Enigma’s wiring in 1932, and was continued until May 1940.)
* Repeatedly using the same stereotypical expressions in messages, an early example of what Bletchley Park would later term cribs. Rejewski wrote that “… we relied on the fact that the greater number of messages began with the letters ANX—German for “to”, followed by X as a spacer”.
* The use of easily guessed keys such as AAA or BBB, or sequences that reflected the layout of the Enigma keyboard, such as “three [typing] keys that stand next to each other [o]r diagonally [from each other]…” At Bletchley Park such occurrences were called cillies. Cillies in the operation of the four-rotor Abwehr Enigma included four-letter names and German obscenities. Sometimes, with multi-part messages, the operator would not enter a key for a subsequent part of a message, merely leaving the rotors as they were at the end of the previous part, to become the message key for the next part.
* Having only three different rotors for the three positions in the scrambler. (This continued until December 1938, when it was increased to five and then eight for naval traffic in 1940.)
* Using only six plugboard leads, leaving 14 letters unsteckered. (This continued until January 1939 when the number of leads was increased, leaving only a small number of letters unsteckered.)
Other useful shortcomings that were discovered by the British and later the American cryptanalysts included the following, many of which depended on frequent solving of a particular network:
* The practice of re-transmitting a message in an identical, or near-identical, form on different cipher networks. If a message was transmitted using both a low-level cipher that Bletchley Park broke by hand, and Enigma, the decrypt provided an excellent crib for Enigma decipherment.
* For machines where there was a choice of more rotors than there were slots for them, a rule on some networks stipulated that no rotor should be in the same slot in the scrambler as it had been for the immediately preceding configuration. This reduced the number of wheel orders that had to be tried.
* Not allowing a wheel order to be repeated on a monthly setting sheet. This meant that when the keys were being found on a regular basis, economies in excluding possible wheel orders could be made.
* The stipulation, for Air Force operators, that no letter should be connected on the plugboard to its neighbour in the alphabet. This reduced the problem of identifying the plugboard connections and was automated in some Bombes with a Consecutive Stecker Knock-Out (CSKO) device.
* The sloppy practice that John Herivel anticipated soon after his arrival at Bletchley Park in January 1940. He thought about the practical actions that an Enigma operator would have to make, and the short cuts he might employ. He thought that, after setting the alphabet rings to the prescribed setting, and closing the lid, the operator might not turn the rotors by more than a few places in selecting the first part of the indicator. Initially this did not seem to be the case, but after the changes of May 1940, what became known as the Herivel tip proved to be most useful.
* The practice of re-using some of the columns of wheel orders, ring settings or plugboard connections from previous months. The resulting analytical short-cut was christened at Bletchley Park Parkerismus after Reg Parker, who had, through his meticulous record-keeping, spotted this phenomenon.
* The re-use of a permutation in the German Air Force METEO code as the Enigma stecker permutation for the day.
Mavis Lever, a member of Dilly Knox’s team, recalled an occasion when there was an extraordinary message.
The one snag with Enigma of course is the fact that if you press A, you can get every other letter but A. I picked up this message and—one was so used to looking at things and making instant decisions—I thought: ‘Something’s gone. What has this chap done? There is not a single L in this message.’ My chap had been told to send out a dummy message and he had just had a fag [cigarette] and pressed the last key on the keyboard, the L. So that was the only letter that didn’t come out. We had got the biggest crib we ever had, the encypherment was LLLL, right through the message and that gave us the new wiring for the wheel [rotor]. That’s the sort of thing we were trained to do. Instinctively look for something that had gone wrong or someone who had done something silly and torn up the rule book
Deep Mind Alphafold Approach Towards Solving Protein Folding
In 2019, the Deep Mind Alphafold approach was summarized. Alphafold 2 was an updated version that got even better results.
Deep learning was just one aspect of the structure prediction process, and the final structures were actually a result of gradient descent optimization. The DeepMind team tried a ‘fancier’ strategy involving fragment assembly using Generative Adversarial Networks (GANs), but in the end, the best results were obtained by gradient descent optimization.
Method 1: Fragment Assembly
The overall protein structure of a protein is a combination of smaller fragmented units of structure; these sub-units are somewhat modular and form motifs that are re-used with slight modifications across different proteins and protein families. This is a major reason for the use of multiple sequence alignment as part of most protein structure prediction models. By comparing a novel protein sequence to a database of sequences that have known structures, an estimate of the sub-structure in the new protein can be inferred by taking the structures formed in proteins with similar sequences as templates.
Notably, while DeepMind did use multiple sequence alignment in their approach, they did not use any templates. That is, while they did compare the sequences to databases of known protein fragment sequences, they didn’t borrow structures associated with those sequences directly. They used a generative neural network to come up with fragment candidates for insertion into an otherwise more conventional structure optimization workflow using simulated annealing. Although GANs have matured substantially in the past few years with impressive results on tasks like de novo image generation, in this case, the results were not state of the art. For that, DeepMind would combine old and new with a pipeline incorporating deep learning for scoring and gradient descent for objective optimization.
Method 2: Gradient Descent on Deep Learning Scores
The core deep learning aspect of DeepMind’s winning entry was a neural network that predicted likely distances between amino acid pairs, as well as the angles of each peptide bond linking amino acid residues. These two predictions were then incorporated into a score along with ‘score2’ from Rosetta modeling software, followed by gradient descent to minimize the combined objective cost directly. This is a bit of an unfair simplification of all the engineering work that went into making the process work so well, but conceptually the winning prediction strategy was surprisingly simple. It wasn’t an end-to-end deep learning project, and the fancier method involving GANs didn’t perform as well. Taken together the DeepMind entry suggests that clever strategy can still beat brute force computational resources applied with big deep learning models, and emphasizes the need to maintain some flexibility in your machine learning hardware and software stack. This is somewhat in contrast to the approach espoused by OpenAI.
SOURCES- Exxact, Wikipedia, Universe Today
Written By Brian Wang, Nextbigfuture.com