Introduction to UNIX Manuel Ruiz Bioinformatics School Campinas

Introduction to UNIX Manuel Ruiz, Bioinformatics School, Campinas, Sao Paulo, Brazil, 21 -26 november 2011

UNIX v v v UNIX is an operating system (like Windows and Mac. OS) Multi-tasking: multiple processes can run concurrently. Multi-user : different users can read mails, copy files, and print all at once.

UNIX n Why use UNIX? - designed for lots of small programs: Shell = toolbox - can link easily programs together : development of automatic workflows - doesn’t waste computer resources on graphics - gives the user much more power - a lot of free bioinformatics tools are available n Unfortunately, at the expense of being user-friendly.

Several Unix Two main families : Unix System V and Unix BSD v Each company his own Unix : Sun (Sun. OS or Solaris), HP (HPUX), IBM (AIX) … v Since 90 : Linux and Free. BSD (=> Mac. OS X) v Cygwin v

Linux : free Unix n Linus Torvalds (Helsinki) n GNU project : any one can use, study source code, modify source code, redistribute n Widely used : world-wide n Several distributions : Red. Hat, Cent. OS, Suse, Debian, Ubuntu …

The Linux System User commands includes executable programs and scripts The shell interprets user commands. It is responsible for finding the commands and starting their execution. Several different shells are available. Bash is popular, The kernel manages the hardware resources for the rest of the system.

Connection n Run Xming n Run Putty

General format of UNIX commands A UNIX command line consists of the name of a UNIX command followed by its "arguments" (options and the target filenames and/or expressions). The general syntax for a UNIX command is command -options targets ls -l /etc Command name Options (flags) Arguments

UNIX Commands n Each word you type in the command line runs a program. So it is easy to add your own commands – just add, or write, another program. n The output of the program is returned to the terminal unless you say otherwise. So all your interaction is through one text window.

Unix Files n Unix is Ca. Se Se. Nsi. Tiv. E! n UNIX filenames contain only letters, numbers, and the _ (underscore), . (dot), and - (dash) characters. NO ACCENTS ! NO SPACES ! Under any circumstances! n The extension (eg. txt, . fasta) can be any number of letters and is optional. It’s for your own convenience so you know what kind of file is what. n You can only have one file in the same directory with the same name. n Filenames : 255 characters maximum
![Special characters have different meanings : &~#” '{([|`^@)]}$*%!/; , ? For example : • Special characters have different meanings : &~#” '{([|`^@)]}$*%!/; , ? For example : •](http://slidetodoc.com/presentation_image_h2/1d2cfb4cbe159936b00d2ff3cf38515b/image-11.jpg)
Special characters have different meanings : &~#” '{([|`^@)]}$*%!/; , ? For example : • wildcard characters, the most common of which is * which tells the shell to substitute any combination of zero or more characters that results in an existing filename.

Working with Directories n Directories organize files on a Unix computer. n n n They are equivalent to folders in Windows and Mac, except they can’t have a space in their name. The directory list that allows you to locate a file is called a PATH (eg. , /home/mruiz/text. txt is the FULL PATH to the file text. txt). Understanding directories is vital.

Typical UNIX directory structure

Typical UNIX directory structure / /bin pronounced ‘slash’ or ‘root’. =where the programs live. Don’t mess /lib =programming libraries. Ignore /etc =admin stuff. Ignore. /usr =more programs, not user files. Don’t mess /mnt =‘mount point’ for floppies, cd roms etc. If you put a cd rom in, it is in /mnt/cdrom /tmp =temporary files. Ignore. /var =more temporary files. Ignore. /home/mruiz /home/fred /home/jane where ALL my files are where Jane’s files are. I can’t see them unless she lets me. A UNIX workstation is usually set up like this, cygwin and Mac. OSX are different

Your Home Directory /home/mruiz n When you log in to any UNIX computer, you start off in your own home directory n This is your home. Create sub-directories to store specific projects or groups of information.

Logging In Enter login name and password ! n System password file: /etc/passwd (usually). n You can change password using the command: passwd. n

Some shell commands n n n Most Important command: man (manual pages). Help: unix commands, C functions. Usage: man <command/function> Try “man man” ! Example: man ls, man passwd, man printf.

Shortcuts There are several shortcuts in Unix for specifying directories. (dot) means "the working directory“ – the one you’re in. cd. . . means "the parent directory" - the directory one level above the working directory. So cd. . will move you up (towards /) one level cd. . /. . two levels ~ (tilde) means your Home directory, so cd~ will take you home.

Some shell commands n n n n pwd: what is the working directory? ls: list contents of directory mkdir <dir-name>: make directory rmdir <dir-name>: remove an empty directory rm –r <dir-name>: remove a directory with all the contents cd <directory>: change directory, ~/ means your home directory cp <source> <target>: copy command.

Some shell commands n chmod <mode> <filename>: change mode of a file/directory n ls –l <directory or filename>: long list with details n 9 permission bits: d r w x n 3 categories: user/group/all. n Permissions: read/write/execute (r/w/x)

File Permissions n Linux provides three kinds of permissions: Read - users with read permission may read the file or list the directory n Write - users with write permission may write to the file or new files to the directory n Execute - users with execute permission may execute the file or lookup a specific file within a directory n

File Permissions n The long version of a file listing (ls -l) will display the file permissions: -rwxrwxr-x -rw-rw-r-drwxrwxr-x 1 1 1 7 rvdheij rvdheij Group Permissions Owner 5224 221 1514 1024 Dec Dec 30 30 30 31 03: 22 03: 59 14: 52 hello. c hello. s posixuft

Interpreting File Permissions -rwxrwxrwx Other permissions Group permissions Owner permissions Directory flag (d=directory; l=link)

Changing File Permissions n Use the chmod command to change file permissions n The permissions are encoded as an octal number chmod 755 file # Owner=rwx Group=r-x Other=r-x chmod 500 file 2 # Owner=r-x Group=--- Other=--chmod 644 file 3 # Owner=rw- Group=r-- Other=r-chmod +x file chmod o-r file chmod a+w file # Add execute permission to file for all # Remove read permission for others # Add write permission for everyone

Some shell commands n touch <option> <filename>: create a new file e. g. : touch directory/filename n rm <option> <filename>: remove files e. g. : rm –fr directory/filename n mv <old> <new>: change the name of a file n ln –s <src> <dest>: create a symbolic link

File System n Hierarchical arrangement of files and directories. n Top level: root or / e. g. : cd / n. Current directory, . . One level higher directory e. g. : cd. No change for it is current directory or cd. . Change to parent directory.

File System n Pathname: absolute and relative. n Absolute pathname: /home/mruiz/text. txt (begins with /) n Relative pathname: text. txt, . . /mruiz/text. txt

Editors n n n Different editors: emacs, nano, nedit, vi emacs <filename> nano <filename> nedit <filename> vi <filename>

Other ways to view files These can be very useful. Try them out: more less head tail text. txt

head/tail n head: displays the first lines (10 lines by default) head Sequences. txt head – 30 Sequences. txt n tail : displays the last lines (10 lines by default)

cat, less and more The cat command reads one or more files and prints them to standard output. The operator > can be used to combine multiple files into one. The operator >> can be used to append to an existing file. The syntax for the cat command is: cat [options] [files] cat file 1 file 2 > all cat file 1 >> file 2

cat, less and more n more The more command displays the file called name in the screen. The RETURN key displays the next line of the file. The spacebar displays the next screen of the file. q for quit. The syntax for the more command is: more [options] [files]barre n more Sequences. txt to compare with cat Sequences. txt (CTRL C to stop)

cat, less and more n less: program similar to more but which allows backward movement in the file as well as forward movement. Also, less does not have to read the entire input file before starting, so with large input files it starts up faster than text editors like vi.

Redirecting standard input and standard output command 1 > file 1 executes command 1, placing the output in a new file named file 1. command 1 >> file 1 executes command 1, placing the output in the existing file named file 1. command 1 < file 1 executes command 1, using file 1 as the source of input command 1 < infile > outfile combines the two capabilities: command 1 reads from infile and writes to outfile

Connecting commands with e. g. : ls -l | more or ls –l | less Pipes n The output of one command can become the input of another: “|” is used to separate stages ps aux | grep netscape | wc -l The output of the ps command is sent to grep takes input and searches for “netscape” passing these lines to wc wc takes this input and counts the lines its output going to the console

Program & Process Program is an executable file that resides on the disk. n Process is an executing instance of a program. n A Unix process is identified by a unique nonnegative integer called the process ID. n Check process status using the “ps” command. n

Your path n To see your path, type echo $PATH n If you are bored with typing the full path to programs, you can put them in your path. n Eg. mkdir ~/bin/ mv program ~/bin/ export PATH=$PATH: ~/bin program

Background processes n. A program run using the ampersand operator “&” creates a background process. n E. g. : program & n CONTROL Z (^Z) : suspend process n bg : switches process to background n fg : switches process to foreground

How to stop a process? n n n Foreground processes can generally be stopped by pressing CONTROL C (^C). Background processes can be stopped using the kill command. Usage: kill SIGNAL <process id list> kill -9 <process id list> (-9 means no blocked) Or kill <process id list>. If a foreground process is not stopping by ^C, you can open another session and use the kill command.

UNIX summary n Use a text terminal for powerful, remote computing n Use ls, cd, mv, cp, nano and friends to deal with files and directories n You can use many tools quickly – but generally the output is in text format

Bioinformatics commands n No bioinformatics programs come with UNIX n Most biology department servers have them installed already. But you should probably know how to do it yourself n It is pretty much the same as installing any other program on UNIX – except you need to keep in mind the requirements for disk space and memory.

Disk space and memory n These are different things. n Disk space is the amount of free space for data on your hard disk drive. n Memory is the amount of RAM installed in the computer. n Both of these are critical for many bioinformatics applications. For example, BLAST databases can be very large and take up a lot of disk space, and in order to search through them, the BLAST program needs to load a lot of data into RAM.

Running programs n To run a program with a command, it needs to be either in your PATH, or you specify the path to it. n E. g. say I have the blastall binary in my home directory. I could run blastall with either of the following: cd /home/mruiz. /blastall n Or, from any directory, /home/mruiz/blastall n Or, I can install it and put it in my path: mv blastall /home/mruiz/bin export PATH=$PATH: /home/mruiz/bin blastall

Being “nice” n blastall takes a lot of resources. n So that more important jobs take precedence (ie other people can still read their terminals) you need to use “nice”. nice –n 10 blastall –p etc.

Big output n How are we going to deal with the size of the output text? nice –n 10 blastall -p blastp -i exampleprotein. txt d Arabidopsis. P >myblastfile. txt nice –n 10 (blastall blah) |more

The background n If you are running a program that takes a long time, especially if redirecting output to a file, put it in the background, and you can keep working.

Running overnight n If you are running a program overnight, use nohup before the program command – this way the command will keep running when you exit the shell. n Don’t forget to use > to redirect output to a file when doing this.

Viewing running processes n You can see all the processes on the system, ranked by how much memory and CPU time they are using. top

Getting information from output files n Often these are huge text files is a great tool for getting at the nitty-gritty. n grep is more powerful, but mostly involves writing scripts, and has been largely superseded by Perl. n awk

grep n “global regular expression and print” n Allows you to pick out lines of a text file that match a query, count them, and retrieve lines around the match. n Usefull options : -i -c -v -A

grep - continued grep ‘Query=’ myblast. txt What sequences did I BLAST? grep –c ‘>’ testprotein. txt How many sequences are in this file? grep –A 10 ‘>’ testprotein. txt Give me the first ten lines of each protein

egrep n grep like command but n accepts complete regular expressions (including ones with « + » , « ? » , « | » , «()» n -f : obtain PATTERN from FILE

awk / gawk n Awk : suite d'action de la forme : motif { action }, le motif permettant de determiner sur quels enregistrements est appliquée l'action n Enregistrement = une chaine de caractères séparée par un retour chariot, en général une ligne. n Champs = une chaine de caractères separée par un espace (ou par le caractère specifié par l'option -F) n Accès aux champs de l'enregistrement courant par la variable $1, $2, . . . $NF , $0 correspond à l'enregistrement complet, NF au nombre de champs de l'enregistrement courant, $NF au dernier champ.

awk / gawk : examples n awk -F ": " '{ $2 = "" ; print $0 }' /etc/passwd imprime chaque ligne du fichier /etc/passwd après avoir effacé le deuxième champs n awk 'END {print NR}' fichier imprime le nombre total de lignes du fichiers n awk '{print $NF}' fichier imprime le dernier champs de chaque ligne n who | awk '{print $1, $5}' imprime le login et le temps de connexion. n awk 'length($0)>75 {print}' fichier imprime les lignes de plus de 75 caractères. (print équivaur à print $0)

Argument list too long n When using grep or other commands that requires a listing or search through several thousand files you may get the "Argument list too long" or "/bin/grep: Argument list too long. " error. n Workaround : xargs n find ~ -type f -print 0 | xargs -0 grep "examplestring « finds all files in your home directory each file that is found is then searched using grep for the text "examplestring".

Remote connection n telnet n ftp : file transferts n ssh/scp/sftp : secure connections n wget

ftp Getting files from remote servers ftp. ncbi. nih. gov

ftp commands n n n n n open ls cd get mget put lcd close bye open a connection same as UNIX get me this file get more than one file put a file on the server local cd close connection exit the ftp program

Secure ftp n Although NCBI allows you to connect using ftp, this is because they have only public files, and they don’t let you upload anything. n Most UNIX computers disallow ftp logins. However, if you can ssh to a computer, you can also use sftp. The commands are identical to ftp, but you can access your own files securely.

scp n Copying files from/to remote servers n scp src: /src_path dest: /dest_path

wget n But what if you want to get a file which is available for download from a website, but not by ftp? n wget will get the contents of any URL and put them in a file. wget www. upm. edu. my

From your desktop : filezilla, winscp

From your desktop : filezilla, winscp

Compression tools n gzip / gunzip : . gz files n compress / uncompress : . Z files n tar –cvf / tar –xvf : . tar files n tar –cvzf / tar –xvzf : . tgz files n Others : bzip, bzip 2, zip

Downloading programs n “ready to run” programs are called binaries in unix-speak. n They are often “zipped” in a. tar. gz file. n To unzip, use gunzip and tar –xvf n To run, specify the path to the program. E. g. , . /program or /home/matt/bin/program n You can download programs for UNIX just as you would for a PC

Source Code n Most bioinformatics software is free, and open source. That is, you can download the actual instructions the programmer wrote. n This is great, because it means you can install these programs on almost any machine.

The root user n n Most UNIX machines have an account called “root” root can see everything, change everything, delete everything, including other users work Unless you buy your own machine, nobody sane will give you root access You usually need root access to install programs in the default location. But you can put them in your home directory instead.

UNIX summary n Use ls, cd, mv, cp, nedit and friends to deal with files and directories n Install, or compile, any program you like. Most are free. n Use blastall, etc on the command line for high throughput work. Transfer the output to a file for best results and run in the background. Grep the output file to get pertinent information….

Other useful commands diff - attempts to determine the minimal set of changes needed to convert a file specified by the first argument into the file specified by the second argument diff file 1 file 2

Other useful commands find - Searches a given file hierarchy specified by path, finding files that match the criteria given by expression find / -name « ls » find. –name « seq_a_moi. fasta »

Other useful commands to explore • sort : sort files • wc : count characters, words, lines • split/csplit : cut fields horizontally • cut : cut fields vertically • paste : merge corresponding or subsequent lines of files • sed : perform basic text transformations on an input stream
- Slides: 71