Digital information preservation in DNA Robust Chemical Preservation

Digital information preservation in DNA Robust Chemical Preservation of Digital Information on DNA in Silica with Error-Correcting Codes (Angewandte 2015) Towards practical, high-capacity, low-maintenance information storage in synthesized DNA (Nature 2013) Next-Generation Digital Information Storage in DNA (Science 2012) Mikk Puustusmaa 2015

Introduction • Information, such as text printed on paper or images projected onto microfilm, can survive for over 500 years. However, the storage of digital information for time frames exceeding 50 years is challenging. • As digital information continues to accumulate, higher density and longer-term storage solutions are necessary • DNA has many potential advantages as a medium for immutable, high latency information storage needs – At theoretical maximum, DNA can encode two bits per nucleotide (nt) or 455 exabytes (455 x 1012 mb) per gram of single-stranded DNA

Advantages • Unlike most digital storage media, DNA storage is not restricted to a planar layer and is often readable despite degradation in non ideal conditions over millennia. • Most recently, 300 000 year old mitochondrial DNA from bears and humans has been sequenced. • DNA’s essential biological role provides access to natural reading and writing enzymes and ensures that DNA will remain a readable standard for the foreseeable future.

Advantages • The number of bases of synthesized DNA needed to encode information grows linearly with the amount of information to be stored, but we must also consider the indexing information required to reconstruct full-length files from short fragments. • As indexing information grows only as the logarithm of the number of fragments to be indexed, the total amount of synthesized DNA required grows sub-linearly.

Problems • Since synthesis and sequencing of very long DNA strands is technically impeded, data must be stored on several short DNA segments • Approaches using living vectors are not as reliable, scalable or cost-efficient owing to disadvantages such as constraints on the genomic elements and locations that can be manipulated without affecting viability, the fact that mutation will cause the fidelity of stored and decoded information to reduce over time, and possibly the requirement for storage conditions to be carefully regulated

Science 2012 • They converted an html coded draft of a book that included 53, 426 words, 11 JPG images, and one Java. Script program into a 5. 27 -megabit bitstream (674, 56 k. B) • Converted individual bits to A or C for 0 and T or G for 1. Bases were chosen randomly while disallowing homopolymer runs greater than three. Addresses of the bitstream were 19 bits long and numbered consecutively, starting from 0000000001.

Science 2012 • They encode one bit per base (A or C for zero, G or T for one), instead of two. This allows them to encode messages many ways in order to avoid sequences that are difficult to read or write such as extreme GC content, repeats, or secondary structure. • By splitting the bit stream into addressed data blocks, we eliminate the need for long DNA constructs that are difficult to assemble at this scale

Synthesis • We synthesized 54, 898 oligonucleotides on Agilent’s Oligo Library Synthesis microarray platform. • In order to avoid cloning and sequence verifying constructs, they synthesized, stored, and sequenced many copies of each individual oligo • 54, 898 159 -nt oligonucleotides – each encoding a 96 -bit data block (96 nt), – a 19 -bit address specifying the location of the data block in the bit stream (19 nt) – flanking 22 -nt common sequences for amplification and sequencing.

Sequencing • We sequenced the amplified library by loading on a single lane of a Hi. Seq 2000 using paired end 100 reads. From the lane we got 346, 151, 426 million paired reads with 87. 14% >= Q 30 and mean Q score of 34. 16. Since we were sequencing a 115 bp construct with paired 100 bp reads, we used Seq. Prep (9) to combine overlapping reads into a single contig. • They joined overlapping paired-end 100 -nt reads to reduce the effect of sequencing error • Errors in synthesis and sequencing are rarely coincident, each molecular copy corrects errors in the other copies

Results • Then with only reads that gave the expected 115 -nt length and perfect barcode sequences, we generated consensus at each base of each data block at an average of ~3000 -fold coverage • All data blocks were recovered with a total of 10 bit errors out of 5. 27 million, which were predominantly located within homopolymer runs at the end of the oligo, where we only had single sequence coverage • Future work could use compression, redundant encodings, parity checks, and error correction to improve density, error rate, and safety.

Nature 2013 They encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5. 2 x 106 bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy

Nature 2013 • The five files comprised all 154 of Shakespeare’s sonnets (ASCII text), a classic scientific paper (PDF format), a mediumresolution colour photograph of the European Bioinformatics Institute (JPEG 2000 format), a 26 -s excerpt from Martin Luther King’s 1963 ‘I have a dream’ speech (MP 3 format) and a Huffman code used in this study to convert bytes to base-3 digits (ASCII text), giving a total of 757, 051 bytes or a Shannon information of 5. 2 x 106 bits • The bytes comprising each file were represented as single DNA sequences with no homopolymers, which are associated with higher error rates in existing high-throughput sequencing technologies and led to errors in a recent DNA-storage experiment

Nature 2013

Trit to nucleotide

Nature 2013 • Each DNA sequence was split into overlapping segments, generating fourfold redundancy, and alternate segments were converted to their reverse complement. These measures reduce the probability of systematic failure for any particular string, which could lead to uncorrectable errors and data loss. • Each segment was then augmented with indexing information that permitted determination of the file from which it originated and its location within that file, and simple parity-check errordetection. • In all, the five files were represented by a total of 153, 335 strings of DNA, each comprising 117 nucleotides (nt).

Synthesis • We synthesized oligonucleotides (oligos) corresponding to our designed DNA strings using an updated version of Agilent Technologies’ OLS • Errors occur only rarely (1 error per 500 bases) and independently in the different copies of each string • DNA in lyophilized form that is expected to have excellent long-term preservation characteristics

Sequencing • paired-end mode on the Illumina Hi. Seq 2000 • Strings with uncertainties due to synthesis or sequencing errors were discarded and the remainder decoded using the reverse of the encoding procedure, with the errordetection bases and properties of the coding scheme allowing us to discard further strings containing errors. Although many discarded strings will have contained information that could have been recovered with more sophisticated decoding, the high level of redundancy and sequencing coverage rendered this unnecessary in their experiment.

Results • Four of the five resulting DNA sequences could be fully decoded without intervention. The fifth however contained two gaps, each a run of 25 bases, for which no segment was detected corresponding to the original DNA. Each of these gaps was caused by the failure to sequence any oligo representing any of four consecutive overlapping segments. • Inspection of the neighbouring regions of the reconstructed sequence permitted us to hypothesize what the missing nucleotides should have been and we manually inserted those 50 bases accordingly. This sequence could also then be decoded. Inspection confirmed that our original computer files had been reconstructed with 100% accuracy.

Results • This also suggests that our mean sequencing coverage of 1, 308 times was considerably in excess of that needed for reliable decoding. But data indicates that reducing the coverage by a factor of 10 (or even more) would have led to unaltered decoding characteristics, which further illustrates the robustness of our DNAstorage method.

Angewandte (2015) • They translated 83 k. B of information to 4991 DNA segments, each 158 nucleotides long, which were encapsulated in silica • They employed error-correcting codes to correct storage -related errors. • Accelerated aging experiments were performed to measure DNA decay kinetics, which show that data can be archived on DNA for millennia under a wide range of conditions.

Angewandte (2015)

Error correcting • In classical data-storage devices, error correcting codes are implemented, which add redundancy and allow the correction of essentially all errors that occur during usage. To account for the specific requirements of storage on DNA the existing data coding schemes had to be adapted: Individual sequences are indexed and two independent error correcting codes (specifically Reed–Solomon codes) are used in a concatenated fashion

Angewandte (2015) • To physically test the code we stored the text from two old documents the Swiss Federal Charter from 1291 and the English translation of the Method of Archimedes • The (uncompressed) total text is 83 kilobytes large, and was encoded. This resulted in 4991 sequences, each 117 nucleotides long to which constant primers were added (giving a total length of 158 nt) • The sequences were synthesized on an electrochemical microarray technology (Custom. Array), prepared for sequencing by a custom PCR (polymerase chain reaction) method, and read using the Illumina Mi. Seq platform

Results • From reading the sequences, the inner code had to correct an average of 0. 7 nt errors per sequence and the outer code had to account for a loss of 0. 3% of total sequences and correct about 0. 4% of the sequences, thereby resulting in a complete and error-free recovery of the original information.

Results • To test if DNA stored in the solid state is more stable, they took the 4991 element oligo pool and tested the stability of three previously established dry storage procedures for DNA by accelerated aging tests.

Result • From the data shown in Figure 2 it is evident that DNA preservation is best in the inorganic storage format (DNA encapsulated in silica), which has the lowest local water concentration. By separating the DNA molecules from the environment by an inorganic layer, the degree of preservation is not affected by the humidity of the storage environment. This independence of humidity is very important for guaranteeing long-term stability, as a nonhumid environment is hard to maintain • In contrast, stabilityincreasing factors such as low temperature (e. g. permafrost) and absence of light can be maintained for extended periods of time without energy input.

Results • The original information could be recovered error free, even after treating the DNA in silica at 70°C for one week. This is thermally equivalent to storing information on DNA in central Europe for 2000 years.

Price • With negligible computational costs and optimized use of the technologies estimated current costs to be $12, 400 / MB for information storage in DNA and $220 /MB for information decoding. • Current technology and our encoding scheme (Nature 2013), DNA-based storage may be cost-effective for archives of several megabytes with a 600– 5, 000 -yr horizon. One order of magnitude reduction in synthesis costs reduces this to , 50– 500 yr; with two orders of magnitude reduction, as can be expected in less than a decade if current trends continue

Price • DNA-based storage might already be economically viable for long horizon archives with a low expectation of extensive access, such as government and historical records • An examplein a scientific context is CERN’s CASTOR system, which stores a total of 80 PB of Large Hadron Collider data and grows at 15 PB yr. Only 10% is maintained on disk. Archives of older data are needed for potential future verification of events, but access rates decrease considerably 2– 3 years after collection. Further examples are found in astronomy, medicine and interplanetary exploration

Conclusion • Density, stability, and energy efficiency are all potential advantages of DNA storage, although costs and times for writing and reading are currently impractical for all but centuryscale archives • DNA-based storage remains feasible on scales many orders of magnitude greater than current global data volumes • However, the costs of DNA synthesis and sequencing have been dropping at exponential rates of 5 - and 12 -fold per year, respectively—much faster than electronic media at 1. 6 -fold per year • DNA synthesis costs drop at a pace that should make data storing on DNA cost-effective for sub-50 -year archiving within a decade.

Tänan Kuulamast!