Figure S1: The two color-coded 60 by 61 substitution matrices represent (A) the DPANN matrix, which was derived from the weights of an artificial neural network and (B) a matrix derived from the same set of structural pairs, but through a log-odds approach. To make both matrices comparable, their values were normalized by the average and standard deviation (). Colors in the blue-spectrum indicate favorable exchanges, while colors in the red spectrum indicate unfavorable exchanges. The log-odd matrix is considerably more structured, but has comparatively weaker overall preferences. The strongest preferences are for maintaining a certain amino acid in a given secondary structure state (represented by the strong blue-tint on the diagonal). Because of the sparseness of the data the log-odd matrix contains a significant number of unobserved exchanges (368 out of 1890 values; in those cases the value was set to a value just below the least frequent one). This is likely one of the reasons why it appears far more structured: Unobserved exchanges automatically lead to negative substitution values.

In contrast the DPANN-matrix accepts many substitutions as neutral, while a few substitutions are highly encouraged or discouraged. With exception of Methionine the substitution of a hydrophobic amino acid by a gap is heavily penalized, while substituting hydrophilic residues with gaps is frequently encouraged. This, too, is expected, as hydrophobic residues tend to occur more frequently in the core of the protein, where insertions and deletions are more difficult to implement, while hydrophilic residues tend to occur on the surface of the protein, where loop-regions with many insertions and deletions are more common. Noteworthy is also the fact that many substitutions in the diagonal are identified as neutral, unlike in the case of the log-odds based matrix. The neutral self-substitutions are for residues in the coil-state and the trend does not extend to hydrophobic amino acids, which always feature highly positive values for self-substitutions. In the off diagonal the strongest positive substitutions are again between hydrophobic residues substituting for each other (Isoleucine, Leucine, Valine and to a lesser extend Phenyalanine), while the most disfavored substitutions are between hydrophobic residues in either helical or strand state substituting for hydrophilic residues or Glycine in coil-state. The appearance of the DPANN-matrix agrees with intuition in that protein structure is known to withstand a large number of substitutions on the sequence-level. It appears that the neural network trained matrix manages to capture some true relationships, despite the sparseness of the data.