String Processing CHP 3 1 Words of Wisdom














- Slides: 14

String Processing CHP # 3 1

Words of Wisdom • The problem is that we sow different than what we expect to reap. You will reap what you sow • If anyone does a righteous deed it ensures to the benefit of his own soul; if he does evil it works against (His own soul). In the end will ye (all) be brought back to your Lord. • Al-Quran (45: 15).

Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is in the filed of word processing. Such processing involve pattern matching , we discuss pattern matching in details, two different algorithms of pattern matching and its complexity. Basic terminology Each programming language contain a character set that is used to communicate with the computer from one language to another language. Following are characters. Alphabet a, b, c, d----------------x, y, z. Digits 0, 1, 2, 3 ---------------9 Special character +, -, / , () , $ , = 3

String is finite sequence S of zero or more character. The number of character in string is called its length. The string with zero character is called empty string or null. Specific strings will be denoted by enclosingle quotation mark. e. g ‘ The End’ , ‘ To be or not to be’ , ‘ ‘ are strings of length 7, 18 and zero. Concatenation let S 1, S 2 be string. The string consisting of the characters of S 1 followed by Characters of string S 2 is called the concatenation of S 1 and S 2. it will be denoted S 1//S 2. e. g ‘THE’ // ‘ END’ = ‘THEEND’ it is noted that length of S 1//S 2 is equal to sum of the length of S 1 and S 2. Substring a string Y is called a substring of string S if there exist strings X and z such that S = X//Y//Z If X is empty string, then Y is called initial substring of S, if z is an empty string then Y is called a terminal substring of S. If y is substring of S then length of S does not exceed X. Storing Strings are stored in three types of structure. I. Fix length structure II. Variable length structure III. Linked structure 4

Fixed length storages In this storage each line of print is viewed as record, where all record have same length i. e where each record accommodates the same number of character. Advantage is ease of accessing data from any given record. The updating data in a given record. Disadvantage. Time is wasted reading an entire record if most of storage consist of inessential blank space. Certain records may require more space than available. When correction consist of more or fewer characters than the original text, changing a misspelled word requires the entire record be changed.

Variable length storage The storage of variable length strings in memory cells with fixed length can be done in two general ways. § One can use a marker, such as two dollar signs ($$), to signal the end of the string. § One can list the length of the string as an additional item in the pointer array. Linked storage is used for most extensive word processing applications, strings are stored by means of linked lists. We discuss word processing operation in details in next chapter. Here we discuss the way strings appear in these data structure. By a (one way) linked list, we mean a linearly ordered sequence of memory cells called nodes, where each node contains an item called link, which points to the next node in list(which contain the address of next node. example discuss on board bat 6 cat sat vat NULL

Character data type Here we discuss how various programing languages handle character data type. Constant many languages denotes string constant by placing the string in either single or double quotation mark. Example on board Variables each programming language has its own rule forming character variables. These variables categorized into three types. Static character variable is that whose length is defined before the program is executed and cannot change throughout the program. Semistatic variable is that in which length may vary during the execution of the program as long as the length does not exceed a maximum value determined by the program before the program is executed. Dynamic character variable we mean a variable whose length can change during the execution of program. 7

String operations Although string may be consider as sequence or linear array of character, groups of consecutive elements in a string(such as word, phrase) called substring. Further more The basic units of access in a string are usually these substrings, not individual characters. Substring Accessing a substring from a given string requires three pieces of information, the name of string, the position of the first character of the substring in the given string and the length of the substring or the position of the last character of the substring. We call this operation SUBSTRING. e. g SUBSTRING(String , Initial , length) Indexing It also called pattern matching, refers to finding the position where a string pattern P first appears in a given string text T. we call it INDEX and write INDEX(text , pattern) If pattern P does not appear in the text T, then INDEX assign value 0. indexing example is on board 8

Concatenation Let S 1, and S 2 be string then concatenation of S 1 and S 2 is denoted by S 1 // S 2 is the string consisting of the character of S 1 followed by the character S 2. e. g S 1 ‘MARK’ S 2 ‘TWIN’ S 1//S 2 = ‘MARKTWIN” Length The number of character in string is called its length, we will write Length(string) e. g LENGTH(‘COMPUTER ’) =9 9

Word Processing In earlier times computer can process data only character type now a days computer process printed text letter articles etc. the operation usually associated with word processing are the following § § Insertion it mean inserting a string in the middle of the text. Deletion it mean removing a string from the text. § Replacing it mean replacing one string in the text y another Insertion Suppose in a given text T we wants to insert a string S so that S begins in position K. we denote this operation by INSERT ( text, position, string) e. g INSERT(‘ABCDEF’, 3 , ‘XYZ’) = ‘ABXYXCDEF’ This insertion function can also be implemented by using string operation INSERT(T, K, S) = SUBSTRING (T, 1, K-1) //S// SUBSTRING (T, K, LENGTH(T)K+1) That is, the initial substring of T before position K, which has length K-!, is connected 10

Continue with String S, and the result is concatenated with remaining part of T, has length LENGTH(T)-(K-1) = LENGTH(T) –K+1 Deletion Suppose in a given text T we wants to remove the substring which begins in position K and length L. we denote this operation by DELET ( text, position, length) e. g DELET(‘ PRESTON’ , 2) = ‘PSTON’ DELET(‘ ABCDEFG’ , 2 , 4) = ‘AFG’ Algo discuss on board.

Replacement Suppose in a given text T we want to replace the first occurrence of a pattern P 1 by a pattern P 2. we will denote this operation by REPLACE(text, pattern 1, Pattern 2) e. g REPLACE(‘ABXYEFGH’, ‘XY’ , ‘CD’) = ‘ABCDDEFGH’ We note that replace function can be expressed as deletion function followed by insertion function. The REPLACE function can be executed by using the following three steps K: = INDEX(T, P 1) T: = DELETE(T, K, Length(P 1)) Insert (T, K, P 2) The first two steps delete P 1 from T, and third step insert P 2 in the position K from which P 1 was deleted. Algo discuss on board. 12

Pattern Matching Algorithm Pattern matching is the problem of deciding whether or not given string pattern P appears In a string text T. we assume that the length P does not exceed the length of T. here we discusses pattern matching algorithm, with this we also discuss complexity of algorithm to measure efficiency. Pattern matching algorithm In this algorithm we compare a given pattern P with each of the substring of T, moving from left to right until we get a match. Structure is Wk = SUBSTRING(T, K, Length(P)) This statement shows that wk denotes the substring of T having same length as P and beginning with the kth character of T. first we compare P character by character, with first substring, W 1. if all the character are the same then p=W 1 and so P appears in T and index(T, P)= 1. suppose some character of P is not match of W 1 then P# W 1. and we 13 move to next substring W 2.

Continue The process stops (a) when we find a match of P with some substring wk. and so P appear in T and index(T, P)= k or (b) when we exhaust all the Wk. 's with no match and hence p does not appear in T. the maximum value MAX of the substring K is equal to LENGTH(T) – LENGTH (P) + 1 (example and algo is discuss on board) 14