DDS A Seismic Processing Architecture Reproducible research workshop
DDS, A Seismic Processing Architecture Reproducible research workshop UBC, Vancouver, 2006 Randall L. Selzler Jerry Ehlers Joseph A. Dellinger* RSelzler @ Data-Warp. com Jerry. Ehlers @ BP. com Joseph. Dellinger @ BP. com 1
DDS ORIGINS: Amoco TRC, early 90’s DDS began at the Amoco Tulsa Research Center at a time of great organizational strain. The job of the TRC was to do research and crunch data, not to write software. Creating software is expensive! Amoco’s solution was an edict that “everyone will use DISCO, or else”. 2
Else! But DISCO just wasn’t good enough! And so chaos ensued. . . We were “mired in seismic processing diversity”. DDS grew up surrounded by: • • USP (Amoco internal trace-header based) SEPlib (ASCII header pointing to data cubes) SU (SEGY trace-header based) DISCO (proprietary monitor-based system) . . and needed to be compatible with all of these! 3
Although formally cast as a research group, in fact the TRC also functioned as an “internal contractor” processing shop. 1) So to catch on, not only would any software have to be usable for quick-turnaround research, but 2) the ability to process large datasets efficiently and in parallel was also of vital importance. [Terabytes of data, Connection Machines, MPI, Open. MP] 3) The group had accumulated a considerable number and variety of computers. [All “Unix”, but CM 5, Cray, Sun, SGI, Linux clusters, 32 and 64 bit. . . ] 4) Finally, there was an urgent need for software that could accomodate all the various mutant SEGY formats coming into the shop, as well as DISCO, SEPlib, SU, and USP! 4
and out of the chaos came. . . John Etgen was using SEPlib for migration algorithm research on the CM 200, a machine that required massively parallel data I/O. He showed SEPlib to Randy Selzler: “I want something that looks like THIS, but can handle the large industrial-strength jobs I need to do!” And thus DDS was born. . . 5
How SEPlib did it “header” file data file . . . processing history. . . esize=4 (bytes) data_format=xdr_float in=data_location n 1=trace_length n 2=number_traces_per_record n 3=number_records regularly sampled cube of IEEE 4 -byte floats of dimension n 1 x n 2 x n 3 d 1=sample_interval o 1=starting sample etc. . . SEPlib was the system favored by the folks writing programs that worked on large data volumes instead of individual traces. 6
DDS can look a lot like SEPlib header file DDS “dictionary” file . . . processing history. . . esize=4 (bytes) data_format=xdr_float type=float 4 format=fcube in=data_location data= data location n 1=trace_length n 2=number_traces_per_record n 3=number_records axis= t offset cdp size. t = trace length size. offset=number traces per record size. cdp= number records d 1=sample_interval o 1=starting sample label 1=seconds etc. . . delta. t= sample_interval origin. t= starting sample units. t= seconds etc. . . 7
DDS can look a lot like SEPlib “dictionary” file data file type=float 4 format=fcube data= data location axis= t offset cdp size. t = trace length size. offset=number traces per record size. cdp= number records delta. t= sample_interval origin. t= starting sample units. t= seconds etc. . . regularly sampled cube of IEEE 4 -byte floats of dimension size. t x size. offset x size. cdp (command-line arguments look a LOT like SEPlib too) 8
DDS’s Generalizations Dictionary … axis= t y cmp … size. t= 1000 size. y= 96 size. cmp= 24 … delta. t= 0. 008 units. t= s … origin. y= 5000 units. y= m … format= segy data= oak 39_@ • N-Dimensional Array of I/O Records • Densely populated for random access • Sequential access if sparse • Meaningful Axis Names • t, x, y, z, w, kx, ky, kz, cmp, shot, offset, … • Extensible Axis Attributes • Regular grid (size, origin, delta, units, …) • Variable grid (grid. z= 1 3 5 7 11, …) • Non-numeric (label. attr= Vp Vs rho) Binary Data Card Header Line Header Great for research! Exotic algorithms and unforeseen domains can be accurately represented and processed as easily as traditional ones. Traces… 9
How USP did it USP-format data file Unix Seismic Processing historical line header (processing history and 3 data dimensions) USP was Amoco’s internally home-grown trace-based processing system, beloved of Amoco’s signal processors. element count trace header trace samples traces USP is similar to SU in concept. USP uses longer trace headers than SU, but they still turned out to not be long enough! . . . USP is still used as much as ever today. 10
SU and USP use fixed-format trace headers defined by include files /* * hdr. h – SU include file for segy offset array */ static struct { char *key; char *type; int offs; } hdr[] = { { "tracl", "i", 0}, { "tracr", "i", 4}, { "fldr", "i", 8}, { "tracf", "i", 12}, { "ep", "i", 16}, { "cdp", "i", 20}, { "cdpt", "i", 24}, { "trid", "h", 28}, { "nvs", "h", 30}, { "nhs", "h", 32}, { "duse", "h", 34}, { "offset", "i", 36}, { "gelev", "i", 40}, { "selev", "i", 44}, { "sdepth", "i", 48}, { "gdel", "i", 52}, {. . . 11
DDS also plays well with USP DDS dictionary file USP-format data file type=float 4 format=usp data= data location axis= t offset cdp comp size. t = trace length size. offset=number traces per record size. cdp= number records size. comp= number components delta. t= sample_interval origin. t= starting sample units. t= seconds etc. . . element count trace header trace samples traces element count trace header trace samples . . . DDS knows what USP headers look like! line header (three dimensions) 12
and SEGY. . . DDS dictionary file type=float 4 ibm EBCDIC cards binary header format=segy data= data location axis= t offset cdp comp size. t = trace length size. offset=number traces per record size. cdp= number records size. comp= number components trace header IBM-format samples . . . delta. t= sample_interval origin. t= starting sample units. t= seconds etc. . . SEGY-format data file Note DDS only bothers to convert back to SEGY’s archaic IBM floats when writing to disk! 13
DDS can speak SU note input format auto-detected editd in=minute 2. usp 3 s=16 3 e=16 2 s=2 2 e=32 2 i=2 out_format= su out_data= stdout: | supswigp clip=. 2 > wiggle. ps 14
DDS dictionaries can point at dictionaries! dict. comp 1 type=float 4 ibm format=segy slice. comp data= dict. comp 1 dict. comp 2 dict. comp 3 axis= t offset cdp comp size. t = trace length size. offset=number traces per record size. cdp= number records size. comp= number components format=segy data= data. c 1. segy SEGY binary data file data. c 1. segy axis= t offset cdp size. t = trace length size. offset=number traces per record size. cdp= number records. . . dict. comp 2 type=float 4 ibm format=segy . . . data= dict. c 2. segy axis= t offset cdp SEGY binary data file 15 data. c 2. segy
DDS plays well with mutant SEGY bridge in= Atlantis_EQ. segy in_format=segy out_format=usp comment="Component Type" straight map: segy: usp. Rc. Comp= "Total. Static" comment="Src and rec locations" map: segy: usp. Sr. Pt. XC= "Src. X / 10" map: segy: usp. Sr. Pt. YC= "Src. Y / 10" map: segy: usp. Sr. Pt. El= "15" fixed number map: segy: usp. Sht. Dep= "Src. Depth / 10" map: segy: usp. Rc. Pt. XC= "Grp. X / 10" map: segy: usp. Rc. Pt. YC= "Grp. Y / 10" map: segy: usp. Grp. Elv= "Spare. I 4[10] / 10" map: segy: usp. Cab. Dep= "Spare. I 4[10]" map: segy: usp. Dst. Sgn= "Dst. Sgn / 10" arithmetic calculation comment="Rec point and line numbers" map: segy: usp. Dp. Pt. Ln= "Spare. I 4[8]" map: segy: usp. Dp. Pt. Lt= "Spare. I 4[9]" comment="Dead or Live" map: segy: usp. Sta. Cor= '( Trc. Id. Code - 1 ) * 30000' | editd in= stdin: 3 e=106 out_data= raw. usp 16
Data formats and mappings • This is how DDS differs from SEPlib. . . The properties of the binary data, and all the elements within the binary data, are looked up in the “dictionary”. • Even the array of trace samples is just another trace field as far as DDS is concerned. • DDS knows a few default formats, but can use any format that you can define. • It can also map to and from any format that you can define the necessary mappings for. • This has the important side effect of documenting the data format, making future reproducibility possible 17
DDS supports generic formats In fact, besides having a few built-in default formats such as USP, SU, and SEGY that are convenient for geophysicists, there is nothing in the core of DDS that limits it to being a seismic processing system! 18
Internal data formats • Programs can define their own internal data formats as well, simply by writing definitions into their own internal dictionary: fdds_printf (‘MOD_FIELD’, ‘ *+ float My. Header 1, My. Header 2; n ’) • DDS will then convert from the format of the data, as documented by its dictionary, to the internal format specified by the program. • On output, the internal format will be converted back into whatever output format has been requested on the command line, or by default, the output format will be the same as the input format. 19
Leverage Diversity? Interoperate! Data handling is fundamental… Non-DDS Application Format and API Emulation With Random Access I/O USP Re-link 1998 Proof of Concept Disk File Pipe/Socket Tape DISCO Support 1997 -2003 Generic Read DDS Application Generic I/O Non-DDS Application API Emulation Foreign Library API Emulation Generic I/O Foreign Format Any DDS Supported Format Generic Write Disk File Pipe/Socket Tape Non-DDS Application 20
Are you scared yet? • You can probably imagine that all this translating between formats can get very complicated. . . fmt: SAMPLE_TYPE= typedef float 4 SAMPLE_TYPE; fmt: USP_ADJUST= typedef enum 4 {USP_LINE_PAD = 0, USP_TRACE_PAD = 0, USP_HLH_SIZE = 2236} USP_ADJUST; fmt: SEQUENCE= typedef USP_TRACE SEQUENCE; alias: fmt: USP_TRACE_PAD= fmt: USP_ADJUST alias: fmt: USP_HLH_SIZE= fmt: USP_ADJUST alias: fmt: USP_LINE_PAD= fmt: USP_ADJUST usp_Num. Rec= 2056. . . But still better than having to change your code or relink your code for every different mutant data format! It also makes it possible to 21 interoperate with historical data formats without too much pain.
DDS scripting as a Rosetta stone /apps/global/bin/bridge in= /hpc/dat 13/zdsr 01/Node/EQ/all. segy in_format=segy out_format=usp comment="Component Type" map: segy: usp. Rc. Comp= "Total. Static" comment="Src and rec locations" map: segy: usp. Sr. Pt. XC= "Src. X / 10" map: segy: usp. Sr. Pt. YC= "Src. Y / 10" map: segy: usp. Sr. Pt. El= "15" map: segy: usp. Sht. Dep= "Src. Depth / 10" comment="Azimuth, Roll Tilt" map: segy: usp. TVPT 01= "100 * Spare. F 4[11]" map: segy: usp. TVPT 02= "100 * Spare. F 4[12]" map: segy: usp. TVPT 03= "100 * Spare. F 4[13]" comment="Dead or Live" map: segy: usp. Sta. Cor= '( Trc. Id. Code - 1 ) * 30000' comment="Shot Time" map: segy: usp. TVPT 15=Date. Year map: segy: usp. TVPT 16=Date. Day map: segy: usp. TVPT 17=Date. Hour map: segy: usp. TVPT 18=Date. Min map: segy: usp. TVPT 19=Date. Sec . . 22
In Conclusion: caveats • Things aren’t so complicated if you use DDS as if it were SEPlib, but then what’s the point? • Because so much functionality already exists in USP, there has been little motivation to flesh out DDS. • The external distribution is a subset of the same stuff we use internally. There has been little effort put into improving the “packaging”. • While there is some documentation, it is somewhat lacking! 23
In Conclusion: upsides • The software infrastructure inside BP today is based almost entirely on DDS and USP. It is BP’s infrastructure both for research and for processing. BP’s advanced imaging team in Houston is “BP’s largest contractor”. • The DDS I/O library was released publicly in 2003 on “freeusp. org”. The core of the USP system was released a year or so earlier on the same web site, along with some ARCO-heritage processing systems as well. • By releasing USP and DDS, BP hoped to make it easier to share algorithms with academia and contractors. • Randy Selzler now wants to create a successor to DDS, but that’s his talk, as the “prophet”, to give. . . 24
- Slides: 24