regular expressions grep Regular expressions describe sets of
regular expressions - grep Regular expressions describe sets of strings with patterns (not the same as globbing) • A normal character matches itself • . matches any normal character • A range [<letters>] matches any one of the <letters>, which can also be a range [^<letters>] matches any not one of the <letters> • ? after a pattern makes it optional • + after a pattern matches one or more repetitions • * after a pattern matches any number of repetitions • {<N>} after a pattern matches <N> repetitions in regular expressions • ^ means the start of the line • $ means the end of the line • ()s round a regular expression makes it one thing to which repetition and placement options can be applied. • grep finds lines in files that match limited regular expressions. grep ‘^>’ file. txt displays lines in file. txt that start with a > grep -c ‘^+$’ *fastq displays lines in all fasta files that are composed of a single +
regular expressions - grep • grep -E finds lines in files that match a regular expression grep –E ‘^[a-z. A-Z]’ file. txt displays all lines in file. txt that start with an alphabetic character grep –E ”a*b+c{4}” *fastq displays lines in all fasta files that contain any number of a’s followed by at least one b and 4 c’s grep -E '^(a*b+c{4})+$' file. txt • grep has some useful options -c to count number of matches -l to list files names that match -v to list lines that don't match looks for lines in file. txt containing exactly repetitions of the abc’s
regular expressions - grep bbbbcc abbccc aaabbbcccddd bccbccbcc 1. University of Miami 2. Umbilical cord 3. U Miami 4. university of Miami 5. UM 6. Useless Men 7. university in Miami Which of the following lines are recognized by the regular expression? ^a*b+c{2} What s the correct regular expression to extract all lines that contain ‘University of Miami’? grep -E '[Uu]*of' UM. txt
regular expressions - grep int a. Dog; int a. Dog ; // int a. Comment. About. ADog; double a. Big. Dog; int Bad. Dog; int dog. With. No. Tail int a. Dog, a. Cat; int a. Space. Dog, a. Space. Cat; int a. Dog, a. Bad. Cat; international. Dog; int a#Dog; internet. Name; // fooo What is the correct regular expression to extract all lines that contain a legal Java style integer definition? grep -E '^([i ]+)(nt +[ai. B][Da. Sn])' int. txt
cut, sort, wc • cut gets columns from a tab-delimited file cut –f 1, 2 file. txt extracts the first two columns of file. txt cut –f 1 -3, 5, 6 file. txt > tmp. txt • • extracts the first three, fifth and sixth columns of file. txt and outputs them to tmp. txt sorts lines from a file sort file. txt sorts lines from file. txt uniq -c file. txt Removes repeated lines in file. txt and counts them wc counts lines, words and characters wc file. txt Counts lines, words and characters in file. txt wc –l file. txt Counts lines in file. txt
paste • paste concatenates files as columns cut –f 1 file. txt > col 1. txt cut –f 2 file. txt > col 2. txt cut –f 3 file. txt > col 3. txt paste col 1. txt col 2. txt col 3. txt paste –d ‘, ’ col 1. txt col 2. txt col 3 concatenates files by their right end with , as delimiter
pipelines • • Pipelines consists in concatenate several commands by using the output of the first command as the input of the next one. Two commands are connected placing the sign “|” between them. cat *fasta | grep -c “^>” counts all > in the beginning of all lines in fasta files cut -f 1 blast_sample. txt | sort -u | wc -l cut -f 1 blast_sample. txt | sort | uniq -c
Commands inside commands • `` is used to run a command within a command wc -l `grep -l int *` grep -l int * | wc -l takes the output of grep and counts the number of lines but wouldn’t that be equivalent?
UNIX and the Internet • ping machine checks if machine is reachable • talk user@machine allows to chat with user@machine • ssh user@machine allows you to remotely login on your account user • scp machine 1: file 1 machine 2: file 2 allows you to copy file 1 on machine 1 to file 2 on machine 2
- Slides: 9