Documenting PGR A short summary about Theories and
Documenting PGR A short summary about Theories and Practical work June 2008 SADC
How do you store your data? • • • In MS Excel spreadsheet? Handwritten cards? In a database (like MS Access)? In you head? In another database format? June 2008 SADC
What is a database? • Databases are designed to offer an organized mechanism for storing, managing and retrieving information. They do so through the use of tables. If you’re familiar with spreadsheets like Microsoft Excel, you’re probably already accustomed to storing data in tabular form. It’s not much of a stretch to make the leap from spreadsheets to databases. • The ability to link tables is a very powerful resource June 2008 SADC
So why not use a spreadsheet? Databases are actually much more powerful than spreadsheets in the way you’re able to manipulate data. Just a few actions you can perform on a database: • • Retrieve all records that match certain criteria Update records in bulk Cross-reference records in different tables Perform complex aggregate calculations June 2008 SADC
What is a database? I have data! In other words: Me too! A well-organized set of interrelated data held in one or more files which are capable of being managed by a software. June 2008 Me also! Me too! SADC
Step by step • Analysis • of gene bank activities to determine information and documentation needs • Design • of the manual and/or computerized system based on documentation and information needs • Implementation • of the system that has been developed June 2008 SADC
Data • Quantitative • Qualitative, description of the object being examined. Blue, horizontal. . Quantitative, dealing with numbers 94 g, 34 mm, 67% June 2008 You can also store pictures, links urladdresses, maps, scanned material. . SADC
A relation can be seen as a table that stores information. Attributes or columns Person Rows Name Filip Address Lund Telephone 12345 Records Peter Malmö 23456 Victoria Malmö 34567 June 2008 SADC
But every table is not an relation. • There can not exist two rows in a table that are exactly the same • Every column in the table has a unique name • There is only one value in every cell. • All values in one column belongs to the same domain or are undefined (null) June 2008 SADC
Redundant data Data in just one table repeat information… Students taking courses in a school Name Address Tel Course Filip Lund 12345 Math Peter Malmö 23456 Math Victoria Malmö 34567 English Peter Malmö 23456 History What if Peter wants to take a course in English? June 2008 SADC
Example: Students taking courses in a University Person_id Name Adress Tel 1 Filip Lund 12345 12 Peter Malmö 23456 123 Victoria Malmö 34567 Course_id Study C 2 Math C 3 Engl C 4 History Person_id Course_id 1 C 2 12 C 4 123 C 3 How do I store information if Peter wants to take a course in English? June 2008 SADC
Primary key • • A primary key can identify the data row It must be unique Cant be null Can be a ID The easiest way is to let the program hold track of the rows and automatically increase the key. Remember the primary key must be unique! June 2008 SADC
Table of accessions, data of seedsample accessi on_num ber NGB 1 NGB 2 NGB 3 NGB 4 NGB 5 Scientific name Triticum aestivum ssp. aestivum genebank _institute_ name origin_c ountry culton _type acqusition _date IDUNA Nordic Gene Bank Sweden CV 20080625 STANDARD Nordic Gene Bank Sweden CV 20080625 JARL Nordic Gene Bank Sweden ANKAR Nordic Gene Bank Denmar k SAXO Nordic Gene Bank accession_name 20080625 CV 20080625 Sweden CV The id’s are the primary key, because they identify the rest of the rows June 2008 SADC
But how store data in a good and reliable way? • Just to put everything in a file, in one big table is not a good idea. • Use a relational database (think of the db as several tables that are connected to each other) • Normalize it! – Split up information into different tables and give the record an identifier! – Save related information in one table. (For instance all personal data in a table called “person”) – Link different tables together so you can retrieve all information you need June 2008 SADC
Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: • eliminating redundant data (for example, storing the same data in more than one table) and • ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored. June 2008 SADC
This database is poorly normalized. In fact, it is not normalized at all. Everything is stored in one big table. Imagine how many things you have to repeat if you test the same June 2008 SADC accession three times!
Person id Name Address Zipcode/city countr Title Institute Ins_street Inst_City Telephone Cou 123 Magdalena Svärdh Bondev 21 227 38 Lund Swe IT staff Nordic Genetic Resource center Smedjev 3 Alnarp +46 40 536640 Swe 124 Louise Bondo Danske vej Farum Den Section leader Nordic Genetic Resource center Smedjev 3 Alnarp +46 40 536640 Swe 125 Pia Ohlsson Blentarp 123 45 Swe Lab Nordic Genetic Resource center Smedjev 3 Alnarp +46 40 536640 Swe 126 Johan Bäckman Dag H. vg 222 54 Lund Swe IT boss Nordic Genetic Resource center Smedjev 3 Alnarp +46 40 536640 Swe 127 Jonas Nordling S Hans gr 124 53 Lund Swe IT staff Nordic Genetic Resource center Smedjev 3 Alnarp +46 40 536640 Swe 128 Slatan Ibrahim Long Street 02 Copenhagen Den Apple expert Pometet LIFE Institute for Agricultural Science Short Street 2630 Taastrup +45 35 28 4500 DNK 129 Person A street 123 45 Malmö swe Seed store Nordic Genetic Resource center Smedjev 3 Alnarp +46 40 536640 Swe 130 Svensson B street 123 45 Malmö swe Seed store Nordic Genetic Resource center Smedjev 3 Alnarp +46 40 536640 Swe 131 Karl Krlssn C street 123 45 Malmö swe Seed store Nordic Genetic Resource center Smedjev 3 Alnarp +46 40 536640 Swe Information about the institute where people work is repeated several times! Imagine how boring it will be to add the same information for all staff members in Nord. Gen! June 2008 SADC
Person id Name Address Zipcode/city countr Title Inst_id 123 Magdalena Svärdh Bondev 21 227 38 Lund Swe IT staff 45 124 Louise Bondo Danske vej Farum Den Section leader 45 125 Pia Ohlsson Blentarp 123 45 Swe Lab 45 126 Johan Bäckman Dag H. vg 222 54 Lund Swe IT boss 45 127 Jonas Nordling S Hans gr 124 53 Lund Swe IT staff 45 128 Slatan Ibrahim Long Street 02 Copenhagen Den Apple expert 46 129 Person A street 123 45 Malmö swe Seed store 45 130 Svensson B street 123 45 Malmö swe Seed store 45 131 Karl Krlssn C street 123 45 Malmö swe Seed store 45 Institutes Inst_id Institute Ins_street Inst_City Telephone Cou 45 Nordic Genetic Resource center Smedjev 3 Alnarp +46 40 536640 Swe 46 Pometet LIFE Institute for Agricultural Science Short Street 2630 Taastrup +45 35 28 4500 DNK June 2008 SADC Much better! If the Institute changes telephone number you just have to change it once!
Relational Database Managers By linking files or information together, a relationship is produced between the tables. The shared field is only stored once, it is not duplicated in each file June 2008 SADC
You often visualize a table like this: person id name address zipcode/city countr title inst_id 123 Magdalena Svärdh Bondev 21 227 38 Lund Swe IT staff 45 124 Louise Bondo Danske vej Farum Den Section leader 45 125 Pia Ohlsson Blentarp 123 45 Swe Lab 45 126 Johan Bäckman Dag H. vg 222 54 Lund Swe IT boss 45 127 Jonas Nordling S Hans gr 124 53 Lund Swe IT staff 45 128 Slatan Ibrahim Long Street 02 Copenhagen Den Apple expert 46 129 Person A street 123 45 Malmö swe Seed store 45 130 Svensson B street 123 45 Malmö swe Seed store 45 131 Karl Krlssn C street 123 45 Malmö swe Seed store 45 June 2008 SADC person id name adress zipcode/city country title inst_id
another example… Fields which have a logical relation to Seed info each other and to the identifying Accession_id field, should be grouped together in a Accession name relation. Species Date of acquisition This is not a good example! Donor. Institute. Name Street Address City Address Other fields… If the gene bank receives another accession at a later date , the entire address would need to be entered again in a different record. It would be better to store information about the donors in a separate table. June 2008 SADC
… this would be better Donor Seed info Accession_id Acc. Don_id Accession name don. Per_id Species don. Inst_id Date of acquisition don. Accname Amount of seeds Don_info Acc. Don_id Other fields… June 2008 Information about the donated material, from which institute, person who donated it, what the accession has been called in the donating institute and other information about the donated material SADC
…yet another example Storage Registration Passport Accession name Scientific name Collecting institute Freezer_number Original country Collector Box_number Other fields… Date of collection Storage_date Other fields… What happens if you have to change the name? You have to change it in all three otherwise data will be lost. The best thing is to store a link to the accession and store all the information about the accession elsewhere. June 2008 SADC
Accession Passport Storage Accession_id Accession name Collecting institute Freezer_number Scientific name Collector Box_number Original country Date of collection Storage_date Other fields… If you misspelled the accession name, you just have to change it once June 2008 SADC
Visualized in a ER-model (Entity-Relationship model) it looks like this: Gene Bank 1: N Seed sending info Our institute (Gene Bank) can receive many seeds (with seed sending information) June 2008 SADC
Visualized in a ER-model (Entity-Relationship model) it looks like this: species 1 1 teacher * accessions 0. . 5 students One species can have many accessions. One accession belongs to just one species One teacher can have up to five students to teach June 2008 SADC
Different categories for data modeling Object oriented model Relational model June 2008 SADC
Data modeling- how to get started • To start – find the object • Let the object become a relation • Each object or relation should have its own information and descriptive attributes • The object becomes the table, the attributes columns and the identifying attribute becomes the primary key • An object could be – Persons – Places – Objects (like Accessions, institutes…) – Events (like seed drying, germination tests. . ) Things you want to store information about June 2008 SADC
Data modeling • A data model makes it easier to understand the meaning of the data • A common and popular data model used in database design is based on the concepts of the Entity – Relationship (ER) model • Conceptual – logical structure on databases. Designed by experts, information users, those who know and are familiar with the information. Identification of important entity and relationship types • Physical structure – designed by experts, database developers. How the logical structure is to be physically implemented (as tables) on the target DBMS June 2008 SADC
Data modeling: • Should represent reality and the information that need to be stored • Help us to visualize our task and clarify rules and restrictions and relations between objects • Gives the staff a chance to participate early in the development • Is a established way of developing information systems • Is a way to document and explain the IT-system June 2008 SADC
Accession and collecting information Start to gather related information into tables and link them together • • • Accession number Scientific name Pedigree name Donor name Acquisition name And more… June 2008 • Collecting organization • Collecting date • Country of collection • And more… SADC
A model of accession descriptors and collecting descriptors: Accession Collecting Accession_id Accession name Collection organization Scientific name Collecting date Cultivar name Country of collecting Donor Province/state Acquisition date Location of collecting site More… Type of sample… The accession relation gather all the information about the accession. The collecting relation describe all the information about the specific collecting event June 2008 SADC
Example of relations Collect 1 Accession Taxon taxon_id sci_name eng_name mandate grin_no thgw 1 N 1 1 accession_id taxon_id accession_name country_id acc_mandate N collect_id accession_id person_id latitude longitude site_name country_id Seedstore batch_id accession_id batch_no collection box_no harvest_year 1 germin_test N June 2008 SADC test_id accession_id batch_id grm_pct test_date person_id N
Wow Instead of repeating a lot of information, You store the accession once and link to other tables with other information! By linking files or information together, a relationship is produced between the tables. Smart! And effective! June 2008 SADC
Part 2 Maybe a short brake? June 2008 SADC
DBMS A software system that enables users to define, create, maintain and control access to the database. • The DBMS is the software that interacts with the users’ application programs and the database. • Typically a DBMS provides the following facilities… June 2008 SADC
• It allows users to define the database with types and structures and the constraints on the data to be stored in the database. (DDL) • It allows users to insert, update, delete and retrieve data from the database. (DML) • It provides controlled access to the database. June 2008 SADC
Example of DML and DDL • Data manipulating language (DML) – Select * from Person; – Insert into Person (Name, Address, tel) values (‘John Book’, ‘California’, 123 45) • Data Definition language (DDL) (Changes in structure) – Create table Person ( Name varchar(40) Not. Null, Address char(10), tel int) June 2008 SADC
DBMS environment Data Procedures People Hardware Software Bridge Human Machine June 2008 SADC
Users with software and questions User 1 Web site, For instance SESTO. Question: List all accessions User 2 DBMS Interpretation of question Data base language (SQL) DDL & DML June 2008 SADC Data bases with information and meta data
Features in SESTO e. g. project archive and pictures archive Entry levels with seed store data: Genus, Taxons, Cultivars and Accessions. To list all Taxons in your gene bank, click the button and information about the taxons and their accessions will be presented. June 2008 SADC
To see the accession list of this taxon (Arachis hypogaea) click the [select]-link June 2008 SADC
To see detailed information about this accession click [select] June 2008 SADC
The accession name SWZ 1 from Zea mays gives this information To see all the information from stored material click this link June 2008 SADC
To edit accession information and storage information, click the [edit] links and a pop up window will be presented Two batches with 10 distribution bags are stored in the Active Storage in freezer 15 in boxes 10 and 8. June 2008 SADC
To add and edit information about an accession. The information goes to the database and will be stored there June 2008 SADC
A documentation system should be… • Reliable • Retrieve information fast • User-friendly • Flexible, should anticipate changes June 2008 SADC
Things you will not regret! • Be organized and structured from the start! • Be sure you have a good data model Spend some time on normalization, it will be worth it later June 2008 • Try to cooperate and work together with the documentation working group • (Documentation is far to important to be left alone with the IT-staff) SADC
Web services • Web services are standardized programs that send your data to databases automatically. • Your data will be published elsewhere • This makes your work recognized • And will give your gene bank credit and your material will be asked for • GBIF, Biocase Have you heard about the new dataset… June 2008 SADC
Last but not least… Data is not information. • Information must be retrieved and understood. • Then knowledge is created. "Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information? " T. S. Eliot (Where is the information we have lost in data? ) June 2008 SADC
Thank you for listening! June 2008 SADC
- Slides: 51