CPSC 335 Compression and Huffman Coding Dr Marina

  • Slides: 38
Download presentation
CPSC 335 Compression and Huffman Coding Dr. Marina Gavrilova Computer Science University of Calgary

CPSC 335 Compression and Huffman Coding Dr. Marina Gavrilova Computer Science University of Calgary Canada

Goal of this lecture is to provide an overview of security concepts in modern

Goal of this lecture is to provide an overview of security concepts in modern software systems.

Futurama reference The television program Futurama contained a substitution cipher in which all 26

Futurama reference The television program Futurama contained a substitution cipher in which all 26 letters were replaced by symbols and called "Alien Language". This was deciphered rather quickly by the die hard viewers by showing a "Slurm" ad with the word "Drink" in both plain English and the Alien language thus giving the key. Later, the producers created a second alien language that used a combination of replacement and mathematical Ciphers. These messages can be seen throughout every episode of the series and the subsequent movies.

Security in literature Sherlock Holmes breaks a substitution cipher in "The Adventure of the

Security in literature Sherlock Holmes breaks a substitution cipher in "The Adventure of the Dancing Men". The Al Bhed language in Final Fantasy X is actually a substitution cipher, although it is pronounced phonetically The language in Starfox Adventures: Dinosaur Planet is also a substitution cipher of the English alphabet. In the Artemis Fowl series by Eoin Colfer there are three substitution ciphers; Gnommish, Centaurean and Eternean, which run along the bottom of the pages or are somewhere else within the books.

History-Ceasar’s cypher A message of flee at once. we are discovered! enciphers to SIAA

History-Ceasar’s cypher A message of flee at once. we are discovered! enciphers to SIAA ZQ LKBA. VA ZOA RFPBLUAOAR!

Visual ciphers

Visual ciphers

Encrypting devices An Enigma machine is used for the encryption and decryption of secret

Encrypting devices An Enigma machine is used for the encryption and decryption of secret messages. Enigma was invented by German’s at the end of WW 1. The early models were used commercially from the early 1920 s, and adopted by military and government services of several countries. In December 1932, the Polish Cipher agency broke Germany's military Enigma ciphers. Five weeks before the outbreak of World War II, on 25 July 1939, in Warsaw, they presented their Enigma-decryption techniques and equipment to French and British military intelligence. During the war, British codebreakers were able to decrypt a vast number of messages that had been enciphered using the Enigma. The intelligence gleaned from this source, codenamed "Ultra" by the British, was a substantial aid to the Allied war effort [Wikepedia].

Vegenere cipher Plaintext: ATTACKATDAWN Key: LEMONLE Ciphertext: LXFOPVEFRNHR

Vegenere cipher Plaintext: ATTACKATDAWN Key: LEMONLE Ciphertext: LXFOPVEFRNHR

Encryption “Masks” data for secure transmission or storage Encrypt(data, encryption key) = encrypted data

Encryption “Masks” data for secure transmission or storage Encrypt(data, encryption key) = encrypted data Decrypt(encrypted data, decryption key) = original data Without decryption key, the encrypted data is meaningless gibberish Symmetric Encryption: Encryption key = decryption key; all authorized users know decryption key (a weakness). DES, used since 1977, has 56 -bit key; AES has 128 -bit (optionally, 192 -bit or 256 -bit) key Public-Key Encryption: Each user has two keys: User’s public encryption key: Known to all Decryption key: Known only to this user Used in RSA scheme (Turing Award!)

RSA is an algorithm for public-key cryptography that is based on the presumed difficulty

RSA is an algorithm for public-key cryptography that is based on the presumed difficulty of factoring large integers. RSA stands for Ron Rivest, Adi Shamir and Leonard Adleman, who first publicly described it in 1977. Clifford Cocks, an English mathematician, had developed an equivalent system in 1973, but it was classified until 1997. A user of RSA creates and then publishes the product of two large prime numbers, along with an auxiliary value, as their public key. The prime factors must be kept secret. Anyone can use the public key to encrypt a message, but with currently published methods, if the public key is large enough, only someone with knowledge of the prime factors can feasibly decode the message.

RSA Public-Key Encryption Let the data be an integer I Choose a large (>>

RSA Public-Key Encryption Let the data be an integer I Choose a large (>> I) integer L = p * q p, q are large, say 1024 -bit, distinct prime numbers Encryption: Choose a random number 1 < e < L that is relatively prime to (p-1) * (q-1) Encrypted data S = I e mod L Decryption key d: Chosen so that d * e = I mod ((p-1) * (q-1)) One can then show that I = S d mod L It turns out that the roles of e and d can be reversed; so they are simply called the public and private keys

Bob-Alice scheme

Bob-Alice scheme

Internet-Oriented Security Key Issues: User authentication and trust. When DB must be accessed from

Internet-Oriented Security Key Issues: User authentication and trust. When DB must be accessed from a secure location, password-based schemes are usually adequate. For access over an external network, trust is hard to achieve. If someone with Sam’s credit card wants to buy from you, how can you be sure it is not someone who stole his card? How can Sam be sure that the screen for entering his credit card information is indeed yours, and not some rogue site spoofing you (to steal such information)? How can he be sure that sensitive information is not “sniffed” while it is being sent over the network to you? Encryption is a technique used to address these issues.

Certifying Servers: SSL, SET If Amazon distributes their public key, Sam’s browser will encrypt

Certifying Servers: SSL, SET If Amazon distributes their public key, Sam’s browser will encrypt his order using it. So, only Amazon can decipher the order, since no one else has Amazon’s private key. But how can Sam (or his browser) know that the public key for Amazon is genuine? The SSL protocol covers this. Amazon contracts with, say, Verisign, to issue a certificate <Verisign, Amazon, amazon. com, public-key> This certificate is stored in encrypted form, encrypted with Verisign’s private key, known only to Verisign’s public key is known to all browsers, which can therefore decrypt the certificate and obtain Amazon’s public key, and be confident that it is genuine. The browser then generates a temporary session key, encodes it using Amazon’s public key, and sends it to Amazon. All subsequent msgs between the browser and Amazon are encoded using symmetric encryption (e. g. , DES), which is more efficient than public-key encryption. What if Sam doesn’t trust Amazon with his credit card information? Secure Electronic Transaction protocol: 3 -way communication between Amazon, Sam, and a trusted server, e. g. , Visa.

Authenticating Users Digita. Amazon can simply use password authentication, i. e. , ask Sam

Authenticating Users Digita. Amazon can simply use password authentication, i. e. , ask Sam to log into his Amazon account. Done after SSL is used to establish a session key, so that the transmission of the password is secure! Amazon is still at risk if Sam’s card is stolen and his password is hacked. Signatures: Sam encrypts the order using his private key, then encrypts the result using Amazon’s public key. Amazon decrypts the msg with their private key, and then decrypts the result using Sam’s public key, which yields the original order! Exploits interchangeability of public/private keys for encryption/decryption Now, no one can forge Sam’s order, and Sam cannot claim that someone else forged the order.

Digital signature http: //www. youdzone. com/signature. html

Digital signature http: //www. youdzone. com/signature. html

Cybercrimes http: //www. pcmag. com/article 2/0, 2817, 2330369, 00. asp http: //securitywatch. pcmag. com/none/283622

Cybercrimes http: //www. pcmag. com/article 2/0, 2817, 2330369, 00. asp http: //securitywatch. pcmag. com/none/283622 -mcafee-s-decade-ofcybercrime http: //newmediarockstars. com/2012/09/gangnam-style-youtubemisplaces-its-most-liked-video/

Introduction to DB Security Secrecy: Users should not be able to see things they

Introduction to DB Security Secrecy: Users should not be able to see things they are not supposed to. E. g. , A student can’t see other students’ grades. Integrity: Users should not be able to modify things they are not supposed to. E. g. , Only instructors can assign grades. Availability: Users should be able to see and modify things they are allowed to.

Access Controls A security policy specifies who is authorized to do what. A security

Access Controls A security policy specifies who is authorized to do what. A security mechanism allows us to enforce a chosen security policy. Two main mechanisms at the DBMS level: Discretionary access control Mandatory access control

Introduction to Coding Huffman Coding Non-determinism of the algorithm Implementations: Singly-linked List Doubly-linked list

Introduction to Coding Huffman Coding Non-determinism of the algorithm Implementations: Singly-linked List Doubly-linked list Recursive top-down Using heap Adaptive Huffman coding

Huffman Coding Algorithm is used to assign a codework to each character in the

Huffman Coding Algorithm is used to assign a codework to each character in the text according to their frequencies. The codework is usually represented as a bitstring. Algorithm starts with the set of individual trees, consisting of a single node, sorted in the order of increasing character probabilities. Then two trees with the smallest probabilities are selected and processed so that they become the left and the right sub-tree of the parent node, combining their probabilities. In the end, 0 are assigned to all left branches of the tree, 1 to all right branches, and the codework for all leaves (characters) of the tree is generated.

Non-determinism of the Huffman Coding

Non-determinism of the Huffman Coding

Non-determinism of the Huffman Coding

Non-determinism of the Huffman Coding

Huffman Algorithm Implementation – Linked List Implementation depends on the ways to represent the

Huffman Algorithm Implementation – Linked List Implementation depends on the ways to represent the priority queue, which requires removing two smallest probabilities and inserting the new probability in the proper positions. The first way to implement the priority queue is the singly linked list of references to trees, which resembles the algorithm presented in the previous slides. The tree with the smallest probability is replaced by the newly created tree. From the trees with the same probability, the first trees encountered are chosen.

Doubly Linked List All probability nodes are first ordered, the first two trees are

Doubly Linked List All probability nodes are first ordered, the first two trees are always removed. The new tree is inserted at the end of the list in the sorted order. A doubly-linked list of references to trees with immediate access to the beginning and to the end of this list is used.

Doubly Linked-List implementation

Doubly Linked-List implementation

Recursive Implementation Top-down approach for building a tree starting from the highest probability. The

Recursive Implementation Top-down approach for building a tree starting from the highest probability. The root probability is known if lower probabilities, in the root’s children, have been determined, the latter are known if the lower probabilities have been computed etc. Thus, the recursive algorithm can be used.

Implementation using Heap The min-heap of probabilities is built. The highest probability is put

Implementation using Heap The min-heap of probabilities is built. The highest probability is put in the root. Next, the heap property is restored The smallest probability is removed and the root probability is set to the sum of two smallest probabilities. The processing is complete when there is only one node in the heap left.

Huffman implementation with a heap

Huffman implementation with a heap

Huffman Coding for pairs of characters

Huffman Coding for pairs of characters

Adaptive Huffman Coding Devised by Robert Gallager and improved by Donald Knuth. Algorithm is

Adaptive Huffman Coding Devised by Robert Gallager and improved by Donald Knuth. Algorithm is based on the sibling property: if each node has a sibling, and the breadth-first right-to-left tree traversal generates a list of nodes with non-increasing frequency counters, it is a Huffman tree. In adaptive Huffman coding, the tree includes a counter for each symbol updated every time corresponding symbol is being coded. Checking whether the sibling property holds ensures that the tree under construction is a Huffman tree. If the sibling property is violated, the tree is restored.

Adaptive Huffman Coding

Adaptive Huffman Coding

Adaptive Huffman Coding

Adaptive Huffman Coding

Sources Web links: MP 3 Converter: http: //www. mp 3 -onverter. com/mp 3 codec/huffman_coding.

Sources Web links: MP 3 Converter: http: //www. mp 3 -onverter. com/mp 3 codec/huffman_coding. htm Practical Huffman Coding: http: //www. compressconsult. com/huffman/ Drozdek Textbook - Chapter 11

Shannon-Fano In the field of data compression, Shannon–Fano coding, named after Claude Shannon and

Shannon-Fano In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is a technique for constructing a prefix code based on a set of symbols and their probabilities (estimated or measured). It is suboptimal in the sense that it does not achieve the lowest possible expected code word length like Huffman coding; however unlike Huffman coding, it does guarantee that all code word lengths are within one bit of their theoretical ideal – entropy.

Shannon-Fano Coding For a given list of symbols, develop a corresponding list of probabilities

Shannon-Fano Coding For a given list of symbols, develop a corresponding list of probabilities or frequency counts so that each symbol’s relative frequency of occurrence is known. Sort the lists of symbols according to frequency, with the most frequently occurring symbols at the left and the least common at the right. Divide the list into two parts, with the total frequency counts of the left part being as close to the total of the right as possible. The left part of the list is assigned the binary digit 0, and the right part is assigned the digit 1. This means that the codes for the symbols in the first part will all start with 0, and the codes in the second part will all start with 1. Recursively apply the steps 3 and 4 to each of the two halves, subdividing groups and adding bits to the codes until each symbol has become a corresponding code leaf on the tree.

Shannon-Fano example

Shannon-Fano example

Shannon-Fano References Shannon, C. E. (July 1948). "A Mathematical Theory of Communication". Bell System

Shannon-Fano References Shannon, C. E. (July 1948). "A Mathematical Theory of Communication". Bell System Technical Journal 27: 379– 423. http: //cm. bell-labs. com/cm/ms/what/shannonday/shannon 1948. pdf. Fano, R. M. (1949). "The transmission of information". Technical Report No. 65 (Cambridge (Mass. ), USA: Research Laboratory of Electronics at MIT).