C Style Strings An Specialized Form of Array

  • Slides: 68
Download presentation
C Style Strings An Specialized Form of Array Copyright © 2007 -2016 Curt Hill

C Style Strings An Specialized Form of Array Copyright © 2007 -2016 Curt Hill

Introduction • The discussion of strings is short on new syntax and long on

Introduction • The discussion of strings is short on new syntax and long on new functions • This makes it somewhat easier, I hope Copyright © 2007 -2016 Curt Hill

Strings are Different • Most of the things we have dealt with are machine

Strings are Different • Most of the things we have dealt with are machine constructs: – int, double, char, for, while, functions • They map very nicely to things that machines handle very well • However, the human machine interface always has to deal with the notion that people read lines of text • cout can handle this with n but cin has problems since to it the blanks and n or t are just 'whitespace', whereas to us there is a very different interpretation of blanks and newlines Copyright © 2007 -2016 Curt Hill

Storage • We also have the problem of storage of strings • Strings are

Storage • We also have the problem of storage of strings • Strings are inherently variable length • When we read in a line of text we may get any number of actual characters Copyright © 2007 -2016 Curt Hill

What do we do? • Historically, there are several main approaches to how we

What do we do? • Historically, there are several main approaches to how we will handle these in memory or as file records • Fixed Length Records • Variable Length Records – Delimiter – Descriptor Copyright © 2007 -2016 Curt Hill

Fixed length records • Each item is some positive constant in length – Originally,

Fixed length records • Each item is some positive constant in length – Originally, the most common was 80 character, punch card images • Items cannot be longer • Shorter items are padded on right with some character, usually a blank • This is the FORTRAN approach Copyright © 2007 -2016 Curt Hill

Variable length with delimiter • Delimiter • There is some special character that says:

Variable length with delimiter • Delimiter • There is some special character that says: I am the end of the string or line • Usually a control character • This is less general – Consider object files where any character can legally occur – Usually there is an escape sequence Copyright © 2007 -2016 Curt Hill

Variable length with descriptor • Descriptor is an explicit length • There is an

Variable length with descriptor • Descriptor is an explicit length • There is an integer with the string which says how large it will be • Usually immediately before first character, usually one, two or four bytes • One byte then string is 256 long max • Two byte then string is 65 k max • Four byte then string is 4 G Copyright © 2007 -2016 Curt Hill

Storage • This storage problem is rather vexing from a machine view • Variable

Storage • This storage problem is rather vexing from a machine view • Variable lengths are difficult to allocate on the stack • We must know the length to access what follows them • Thus we must allocate a maximum and waste what we do not use Copyright © 2007 -2016 Curt Hill

Examples: • IBM Mainframe systems employ the first and last in file systems •

Examples: • IBM Mainframe systems employ the first and last in file systems • Fixed length files each record is always the same length • Card files – Tape or disk as well – This is also possible in C++ with just ordinary arrays of characters – Standard Pascal, FORTRAN and COBOL also use this, among others Copyright © 2007 -2016 Curt Hill

Examples (continued) • IBM Mainframe systems also employed a variable length record, among others

Examples (continued) • IBM Mainframe systems also employed a variable length record, among others – wx. Strings among others – Allocate the maximum number of bytes and then maintain a length indicator – File systems do not need to allocate maximum but used length only Copyright © 2007 -2016 Curt Hill

Delimited variable length • UNIX, DOS, Windows use this for text files • CR,

Delimited variable length • UNIX, DOS, Windows use this for text files • CR, LF or CR/LF is the line delimiter – UNIX and LINUX uses linefeed – Windows and DOS uses CR/LF – Each file occupies a whole number of allocation units (sectors or blocks) and the end of the file is marked with a character or character string to mark end of file also • C/C++ uses this for strings – Null character is delimiter Copyright © 2007 -2016 Curt Hill

Delimiter Again • Allocate the maximum amount of memory needed for the string •

Delimiter Again • Allocate the maximum amount of memory needed for the string • Use a byte with a binary zero to mark the end • This is ‘’ • Nothing after the is considered as valid contents Copyright © 2007 -2016 Curt Hill

Discussion • All of these approaches are a concession to how people do things

Discussion • All of these approaches are a concession to how people do things • They are not neat and clean compared to other kinds of things, such as integers • Mostly because of the variable length approach • The delimiter approach resembles the unfull array technique – Built in to string libraries Copyright © 2007 -2016 Curt Hill

Strings usage • We have already seen strings • A constant string is enclosed

Strings usage • We have already seen strings • A constant string is enclosed in double quotes whereas an ASCII character constant is one character inside apostrophes • The string “hi” is how many characters? – Two for hi and one for = 3 Copyright © 2007 -2016 Curt Hill

Null character • A byte with a value of zero – Not the zero

Null character • A byte with a value of zero – Not the zero digit • Automatically provided by a double quoted string • May also be supplied by escape sequence: ‘’ • Initialization: char str[3] = “Hi”; char c = ‘’; char d = 0; Copyright © 2007 -2016 Curt Hill

Common mistake: • char x[5] = “Hello”; • We do not have room for

Common mistake: • char x[5] = “Hello”; • We do not have room for – Should be compile error – Some detect others not – Is detected in Dev • The absence of that can cause runtime errors that will be noted later • The is always appended to any string in quotes Copyright © 2007 -2016 Curt Hill

Declaration • Declaration of a string is just the same as declaring an array

Declaration • Declaration of a string is just the same as declaring an array of characters • Recall that an array of characters can be handled as a string or any other way consistent with an array of type char • char str[9]="Hi there"; char str[] = “Hi there”; Copyright © 2007 -2016 Curt Hill

Declaration Again • char str[10]="Hi there"; Declares str as a string of length 10

Declaration Again • char str[10]="Hi there"; Declares str as a string of length 10 – Initializes first nine characters – First eight as above – Ninth with – Tenth is undefined • The only real difference between this any other array is the shorthand for strings: – char str[10] = {'H', 'i', 't', 'h', 'e', 'r', 'e', ’’}; Copyright © 2007 -2016 Curt Hill

String usage • Most other differences between a string and any other array is

String usage • Most other differences between a string and any other array is found in the standard functions • First we will consider the fstreams • Second two libraries Copyright © 2007 -2016 Curt Hill

cout • cout (and all ofstreams) may handle a string as we have seen

cout • cout (and all ofstreams) may handle a string as we have seen • However, since it does not know the length, it must search for the Null character to terminate • If there is no Null it considers the string longer than it actually is until it finds a coincidental Null in memory • The Null is common in memory, usually, being the first three bytes of positive ints that are small • Nevertheless, it is easy to get tens, hundreds or thousands of extra bytes displayed Copyright © 2007 -2016 Curt Hill

cin • A different story • In cin whitespace is still skipped • So

cin • A different story • In cin whitespace is still skipped • So if you read in the string – Hello there – will get 6 characters - the Hello plus Null • The leading whitespace is skipped and the string is terminated with the blank between the o and t • Solutions: – There are three versions of cin. get that will be helpful • A no parameter version • A one parameter version • A two or three parameter version Copyright © 2007 -2016 Curt Hill

Get • These are methods of ifstreams • char get(void) – gets one character

Get • These are methods of ifstreams • char get(void) – gets one character and returns it – Does not skip whitespace • char * get(char &) – gets one character without whitespace skipping – Returns a parameter that we will mostly ignore that can be used to indicate success – It is actually a pointer, but we can use it like an integer where 0 means unsuccessful Copyright © 2007 -2016 Curt Hill

Examples • Read all the characters: char ch[10000]; int i = 0; while(cin) ch[i++]

Examples • Read all the characters: char ch[10000]; int i = 0; while(cin) ch[i++] = cin. get(); • Also char ch[10000]; int i = 0; while(cin. get(ch[i++])); Copyright © 2007 -2016 Curt Hill

String Get • int get(char p[ ], int n, char = ‘n’) • The

String Get • int get(char p[ ], int n, char = ‘n’) • The initial argument is a string to read the characters into • n is the maximum number of characters to obtain • Since this form always terminates strings with a , the maximum number of input characters is only n-1 • Hence cin. get(st, 1) only loads the Copyright © 2007 -2016 Curt Hill

String Get • The third parameter is a terminator character • This can be

String Get • The third parameter is a terminator character • This can be anything, though the default is an excellent choice • The get will read characters and store them in p until one of the following conditions is met: – Too many characters – Delimiter is found • When we are done, if the delimiter was found it will be the next unread character – Hence it will never read a delimiter Copyright © 2007 -2016 Curt Hill

Getline • int getline(char [], int, char = ‘n’) • Essentially the same as

Getline • int getline(char [], int, char = ‘n’) • Essentially the same as three parameter get except it eats the delimiter and does not copy it to the buffer • This is my favorite Copyright © 2007 -2016 Curt Hill

Examples • Declaration char line[MAX]; • This will read the line but leave the

Examples • Declaration char line[MAX]; • This will read the line but leave the end of line in the input buffer: cin. get(line, MAX); • This will read the line, discard the end of line: cin. getline(line, MAX); • A comma delimited file might be read: cin. getline(line, MAX, ’, ’); • No good way to read where two or more different delimiters Copyright © 2007 -2016 Curt Hill

String assignment • Given char a[10], b[10]; • Can we: – a = b;

String assignment • Given char a[10], b[10]; • Can we: – a = b; • No • Can we: – a = "Hi there"; • NO • How then do we string assignment? • Like any array manipulation Copyright © 2007 -2016 Curt Hill

The Hard Way • Usually by function call or something involving a for loop

The Hard Way • Usually by function call or something involving a for loop • Like all arrays the following is possible: char a[10], b[10]; for(i=0; i<10; i++) a[i]= b[i]; • Or we can define a function to do the same thing: void str_asgn (char target[], const char src[], int size); Copyright © 2007 -2016 Curt Hill

str_asgn void str_asgn (char target[], const char src[], int size){ int i; for(int i

str_asgn void str_asgn (char target[], const char src[], int size){ int i; for(int i = 0; i<size; i++){ target[i] = src[i]; if(target[i] == 0) break; } } Copyright © 2007 -2016 Curt Hill

Overlapping Arrays • One of the problems with this function is that overlapping arguments

Overlapping Arrays • One of the problems with this function is that overlapping arguments will cause weird results • For example – str_asgn(&a[1], a, 10); • However, it uses next to no memory • What actually happens? Copyright © 2007 -2016 Curt Hill

Overlap • Suppose the following array: char a[5] = “hi”; • And we call:

Overlap • Suppose the following array: char a[5] = “hi”; • And we call: str_asgn(&a[1], a, 5); • Then a[0] is copied to a[1] – This is the ‘h’ which now occupies the first two characters • Next a[1] is copied to a[2] – This is the ‘h’ which now occupies the first three characters Copyright © 2007 -2016 Curt Hill

target At beginning of copy h i � * * source Copyright © 2007

target At beginning of copy h i * * source Copyright © 2007 -2016 Curt Hill

First Copy target h h � * * source Copy source[0] to target[0] Copyright

First Copy target h h * * source Copy source[0] to target[0] Copyright © 2007 -2016 Curt Hill

Second Copy target h h h * * source Copy source[1] to target[1] Copyright

Second Copy target h h h * * source Copy source[1] to target[1] Copyright © 2007 -2016 Curt Hill

Third Copy target h h * source Copy source[2] to target[2] You see the

Third Copy target h h * source Copy source[2] to target[2] You see the pattern. Handy if this is what you want. Copyright © 2007 -2016 Curt Hill

String operations • What can we do to an integer (assume int i, j;

String operations • What can we do to an integer (assume int i, j; ) • Many things – Comparison: if(i<j) – Arithmetic: i*j-2 – Assignment i=j; • What can we do to two arrays (assume int x[5], y[5]) – Next to nothing without resorting to a function • Should we consider a string an elementary type or a structured type (in this case array) Copyright © 2007 -2016 Curt Hill

Structured Types • Clearly C/C++ thinks of strings as arrays so we can do

Structured Types • Clearly C/C++ thinks of strings as arrays so we can do next to nothing • We cannot assign two strings • It seems like we can do nothing to strings other than write functions that manipulate or use existing functions that manipulate • Fortunately most of the useful functions have already been written Copyright © 2007 -2016 Curt Hill

Utility string functions • The first library to consider is string. h • Inside

Utility string functions • The first library to consider is string. h • Inside this are some utility functions that help us to perform string manipulation • Some of these we will consider and many others not Copyright © 2007 -2016 Curt Hill

strlen • int strlen(const char*source) • Takes a string as an argument and finds

strlen • int strlen(const char*source) • Takes a string as an argument and finds the length of the string • Not physical length but the position of the character • It is the length of the usable string and the subscript of the character • Extremely handy • It may overflow – It may give a logical length greater than the physical length Copyright © 2007 -2016 Curt Hill

memcpy • The two mem functions are not string functions but array functions •

memcpy • The two mem functions are not string functions but array functions • void *memcpy(char s[ ], const char ct[], const int n) • copy n chars from ct to s • return pointer to s Copyright © 2007 -2016 Curt Hill

memmove • Same as memcpy except works if operands overlap • Moves (copies really)

memmove • Same as memcpy except works if operands overlap • Moves (copies really) length characters from source to dest. • Often folds into one machine language instruction • Does not care about , is guided only by length Copyright © 2007 -2016 Curt Hill

Example • The mem's can be used for gross array movement of any sort

Example • The mem's can be used for gross array movement of any sort • For example: int a[10], b[10]; . . . memcpy(a, b, 10*sizeof(int)); – sizeof is an operator that takes an expression or parenthesized type Copyright © 2007 -2016 Curt Hill

Characteristics • String functions have a number of characteristics making them easier to remember

Characteristics • String functions have a number of characteristics making them easier to remember • They all start with str – Usually followed by three or four letters – This is descriptive • The first parameter is usually a string and the most important one – Only one to be changed Copyright © 2007 -2016 Curt Hill

strcpy • char * strcpy(char s[], const char ct[]) • Copy ct to s,

strcpy • char * strcpy(char s[], const char ct[]) • Copy ct to s, including the • The return value is the pointer to s • No overlap is allowed and there had better be a Copyright © 2007 -2016 Curt Hill

Two Flavors • Almost all string functions come in two flavors – Brave and

Two Flavors • Almost all string functions come in two flavors – Brave and bold – Cautious • The brave version always believes that a null character will be found • The cautious version takes an additional integer which is the maximum length – Always has an n in the name right after the str Copyright © 2007 -2016 Curt Hill

strncpy • char * strncpy(char s[], const char ct[], int n) • Copy ct

strncpy • char * strncpy(char s[], const char ct[], int n) • Copy ct to s, including the or at most n characters whichever comes first • The return value is the pointer to s • No overlap is allowed Copyright © 2007 -2016 Curt Hill

strcat • Short for concatenate • char * strcat(char s[], const char ct[]) •

strcat • Short for concatenate • char * strcat(char s[], const char ct[]) • Copy ct to end of s – The of s is replaced and the end of the string is supplied from ct • The return value is the pointer to s • No overlap is allowed and there had better be a Copyright © 2007 -2016 Curt Hill

strncat • char * strcat(char s[], const char ct[], int n); • Copy ct

strncat • char * strcat(char s[], const char ct[], int n); • Copy ct to end of s – The of s is replaced and the end of the string is supplied from ct • Copy at most n characters onto s • The new length is the sum of the length of s and the copied characters • The return value is the pointer to s • No overlap is allowed Copyright © 2007 -2016 Curt Hill

Recall • All these functions are straight from the C library • Standard in

Recall • All these functions are straight from the C library • Standard in every implementation of C/C++ since the 70 s • C had no bool until the 90 s, so comparisons return an int • Also functions that return a character will actually return an int – This will be automatically be cast to char Copyright © 2007 -2016 Curt Hill

strcmp • Comparison • int strcmp(const char s[], const char t[]) • Compare s

strcmp • Comparison • int strcmp(const char s[], const char t[]) • Compare s to t • Returns – if s<t returns <0 – returns 0 if s==t – if s>t returns >0 • No overlap is allowed and there had better be a Copyright © 2007 -2016 Curt Hill

Comparing characters • When two integers are compared, the whole integer participates • String

Comparing characters • When two integers are compared, the whole integer participates • String comparison is somewhat different • We sequentially compare corresponding characters • The result is the result between the first pair that is different • A substring is always less than the larger string • Character comparison is based on collating sequence Copyright © 2007 -2016 Curt Hill

Example • Compare two strings: “bbbazz” “bbbbaa” • First string is less • Compare

Example • Compare two strings: “bbbazz” “bbbbaa” • First string is less • Compare two strings: “zzz” “zzza” • The shorter is less than the longer • “Z” < “a” in ASCII Copyright © 2007 -2016 Curt Hill

strncmp • int strncmp(const char s[], const char t[], int n) • Compare first

strncmp • int strncmp(const char s[], const char t[], int n) • Compare first n characters of s and t • Returns – if s<t – return==0 if s==t – if s>t • No overlap is allowed Copyright © 2007 -2016 Curt Hill

strchr • char * strchr(const char s[], const char c) • Looks for first

strchr • char * strchr(const char s[], const char c) • Looks for first c in s • Returns the pointer to the character if found and NULL otherwise • There had better be a Copyright © 2007 -2016 Curt Hill

strrchr • char * strrchr(const char s[], const char c) – Nearly the same

strrchr • char * strrchr(const char s[], const char c) – Nearly the same but starts at right side • Looks for last c in s • Returns the pointer to the character if found and NULL otherwise • There had better be a Copyright © 2007 -2016 Curt Hill

Many others • There are many others here as well that are less important:

Many others • There are many others here as well that are less important: – – – – – strspn strcspn strrpbrk strstr strerror strtok memcmp memchr memset Copyright © 2007 -2016 Curt Hill

Utility character functions • Another library of importance is ctype. h • These are

Utility character functions • Another library of importance is ctype. h • These are functions that do something with a single character – Classifies – Converts case Copyright © 2007 -2016 Curt Hill

isalpha • int isalpha (const char c); • Is the character c a letter

isalpha • int isalpha (const char c); • Is the character c a letter (upper or lower) • Returns 0 for false and 1 for true Copyright © 2007 -2016 Curt Hill

isupper and islower • int isupper(const char c); • Is c an upper case

isupper and islower • int isupper(const char c); • Is c an upper case letter • int islower(const char c); • Is c a lower case letter Copyright © 2007 -2016 Curt Hill

More • int isdigit(const char c); – Is c a digit • int isalphanum(const

More • int isdigit(const char c); – Is c a digit • int isalphanum(const char c); – Is c a letter or digit • int iscntrl(const char c); – Is c a control character • int isspace(const char c); – Is c white space (blank, tab, newline. . . ) Copyright © 2007 -2016 Curt Hill

More • int isprint(const char c); • Is c printable (printables and space) •

More • int isprint(const char c); • Is c printable (printables and space) • int ispunct(const char c); • Is c a printing character except space, letters or digits • int isxdigit(const char c); • Is c a digit in hexadecimal(0 -9, A-F) • int isgraph(const char c); • Is c a graphic charactern (printing except space) Copyright © 2007 -2016 Curt Hill

Conversion • int tolower(const char c); • Convert c to lower case • If

Conversion • int tolower(const char c); • Convert c to lower case • If !(isupper(c)) Then c is returned • int toupper(const char c); • Convert c to upper case • If !(islower(c)) Then c is returned Copyright © 2007 -2016 Curt Hill

Advantages • Strings have several privileges over any other array • Easy constant array

Advantages • Strings have several privileges over any other array • Easy constant array notation – May be used other than in declarations • Integrated unfull array scheme Copyright © 2007 -2016 Curt Hill

String Objects • Despite these advantages the string objects are the better approach •

String Objects • Despite these advantages the string objects are the better approach • They allow easy assignment and comparison • Their methods provide all the extra things needed • Strings were good for C, but object use is the C++ way Copyright © 2007 -2016 Curt Hill

Object Strings Strategy • Store the string on the heap • Keep in the

Object Strings Strategy • Store the string on the heap • Keep in the object a pointer to the string • Other info, such as lengths may also be retained • Examples: String, Ansi. String, cstring, wx. String and many others Copyright © 2007 -2016 Curt Hill

Once More • One more complication: lengths • Single byte strings were sufficient for

Once More • One more complication: lengths • Single byte strings were sufficient for a while • They have difficulty with unusual character sets such as Japanese or Chinese • Another presentation is needed for these Copyright © 2007 -2016 Curt Hill