Linux for Bioinformatics
Navigating in Linux file system
You are in your home directory after you log into the system and are directed to the shell command prompt. This section will show you hot to explore Linux file system using shell commands.
Path
To understand Linux file system, you can image it as a tree structure.
In Linux, a path is a unique location of a file or a directory in the file system.
For convenience, Linux file system is usually thought of in a tree structure. On a standard Linux system you will find the layout generally follows the scheme presented below.
The tree of the file system starts at the trunk or slash, indicated by a forward slash (/
). This directory, containing all underlying directories and files, is also called the root directory or “the root” of the file system.
%%bash
## In your account, you will see a folder
## with you account ID as the name
cd ~
echo $HOME
/home/xie186
Relative and absolute path
- Absolute path
An absolute path is defined as the location of a file or directory from the root directory(/). An absolute path starts from the root
of the tree (/
).
Here are some examples:
/home/xie186
/home/xie186/.bashrc
- Relative path
Relative path is a path related to the present working directory:
data/sample1/
and ../doc/
.
If you want to get the absolute path based on relative path, you can use readlink
with parameter -f
:
pwd
readlink -f ../
Once we enter into a Linux file system, we need to 1) know where we are; 2) how to get where we want; 3) how to know what files or directories we have in a particular path.
Check where you are using command pwd
In order to know where we are, we need to use pwd
command. The command pwd
is short for “print name of current/working directory”. It will return the full path of current directory.
Command pwd is almost always used by itself. This means you only need to type pwd
and press ENTER
%%bash
pwd
Listing the contents using command ls
After you know where you are, then you want to know what you have in that
directory, we can use command ls
to list directory contents
Its syntax is:
ls [option]... [file]...
ls
with no option will list files and directories in bare format. Bare format means the detailed information (type, size, modified date and time, permissions and links etc) won’t be viewed. When you use ls
by itself, it will list files and directories in the current directory.
ls ~/
ls -a
ls -ld
Linux command options can be combined without a space between them and with a single - (dash).
The following command is a faster way to use the l and a options and gives the same output as the Linux command shown above.
ls -lt ~/.bashrc
-rw-r--r--. 1 xie186 zt-bioi611 1067 Aug 22 22:27 /home/xie186/.bashrc
Change directory using command cd
Unlike pwd
, when you use cd
you usually need to provide the path (either absolute or relative path) which we want to enter.
If you didn’t provide any path information, you will change to home directory by default.
Path | Shortcuts | Description |
---|---|---|
Single dot | . | The current folder |
Double dots | .. | The folder above the current folder |
Tilde character | ~ | Home directory (normally the directory:/home/my_login_name) |
Dash | - | Your last working directory |
Here are some examples:
cd ~
pwd
ls
ls ../
##
pwd
cd ../
pwd
cd ./
pwd
Each directory has two entries in it at the start, with names .
(a link to itself) and ..
(a link to its parent directory). The exception, of course, is the root directory, where the ..
directory also refers to the root directory.
Sometimes you go to a new directory and do something, then you remember that you need to go to the previous working direcotry. To get back instantly, use a dash.
%%bash
# This is our current directory
pwd
# Let us go our home diretory
cd ~
# Check where we are
pwd
# Let us go to your previous working directory
cd -
# Check where we are now
pwd
/home/xie186/BIOI611_lab/docs
/home/xie186
/home/xie186/BIOI611_lab/docs
/home/xie186/BIOI611_lab/docs
Manipulations of files and directories
In Linux, manipulations of files and directories are the most frequent work. In this section, you will learn how to copy, rename, remove, and create files and directories.
Command line cp
In Linux, command cp
can help you copy files and directories into a target directory.
Command line mv
Move files/folders and rename file/folders using mv
:
# move file from one location to another
mv file1 target_direcotry/
# rename
mv file1 file2
mv file1 file2 file3 target_direcotry/
Command mkdir
The syntax is shown as below:
mkdir [OPTION ...] DIRECTORY ...
Multiple directories can be specified when calling mkdir
mkdir directory1 directory2
mkdir -p foo/bar/baz
How to defining complex directory trees with one command:
mkdir -p project/{software,results,doc/{html,info,pdf},scripts}
Then you can view the directory using tree
.
Command rm
You can use rm to remove both files and directories.
## You can remove one file.
rm file1
## `rm` can remove multiple files simutaneously
rm file2 file3
You can also use 'rm' to remove a folder. If a folder is empty, you can remove it using rm with -r
.
rm -r FOLDER
If a folder is not empty, you can remove it using rm with -r
and -f
.
mkdir test_folder
rm -r test_folder
View text files in Linux
Commands cat
, more
and less
The command cat is short for concatenate files and print on the standard output.
The syntax is shown as below:
cat [OPTION]... [FILE]...
For small text file, cat can be used to view the files on the standard output.
The command more is old utility. When the text passed to it is too large to fit on one screen, it pages it. You can scroll down but not up.
The syntaxt of more
is shown below:
more [options] file [...]
The command less was written by a man who was fed up with more’s inability to scroll backwards through a file. He turned less into an open source project and over time, various individuals added new features to it. less is massive now. That’s why some small embedded systems have more but not less. For comparison, less’s source is over 27000 lines long. more implementations are generally only a little over 2000 lines long.
The syntaxt of less is shown below:
less [options] file [...]
Command head
and tail
The command head
is used to output the first part of files. By default, it outputs the first 10 lines of the file.
head [OPTION]... [FILE]...
Here is an exmaple of printing the first 5 files of the file:
head -n 5 code_perl/variable_assign.pl
In fact, the letter n does not even need to be used at all. Just the hyphen and the integer (with no intervening space) are sufficient to tell head how many lines to return. Thus, the following would produce the same result as the above commands:
head -5 target_file.txt
The command tail
is used to output the last part of files. By default, it prints the last 10 lines of the file to standard output.
The syntax is shown below:
tail [OPTION]... [FILE]...
Here is an exmaple of printing the last 5 files of the file:
tail -5 target_file.txt
To view lines from a specific point in a file, you can use -n +NUMBER
with the tail
command. For example, here is an example of viewing the file from the 2nd line of the line.
tail -n +2 target_file.txt
Auto-completion
In most Shell environment, programmable completion feature will also improve your speed of typing. It permits typing a partial name of command or a partial file (or directory), then pressing TAB
key to auto-complete the command. If there are more than one possible completions, then TAB will list all of them.
A handy autocomplete feature also exists. Type one or more letters, press the Tab key twice, and then a list of functions starting with these letters appears. For example: type so
, press the Tab
key twice, and then you get the list as:
soelim sort sotruss soundstretch source
Demonstration of programmable completion feature.
File permissions
In Linux, file permissions are a vital aspect of system security and resource management. This is particularly important in bioinformatics, where large datasets and scripts are often shared across teams. Permissions determine who can read, write, or execute a file, ensuring that critical data is not accidentally modified or deleted.
Three Permission Categories:
- User (u): The owner of the file.
- Group (g): A group of users who share access to the file.
- Other (o): All other users on the system.
Permission Types :
- Read (r): Ability to view the contents of a file.
- Write (w): Ability to modify or delete the file.
- Execute (x): Ability to run the file as a program (for scripts or executables).
%%bash
groups $USER animako eunal gstewar1 mjames17 mjeakle nmilza rahooper
xie186 : zt-bioi611 zt-bioi611_mgr
animako : zt-bioi611
eunal : zt-bioi611
gstewar1 : zt-bioi611
mjames17 : zt-bioi611
mjeakle : zt-bioi611
nmilza : zt-bioi611
rahooper : zt-bioi611
%%bash
mkdir -p ~/test_permission/
touch ~/test_permission/test.txt
ls -l ~/test_permission/
rm -rf ~/test_permission/
total 0
-rw-r--r--. 1 xie186 zt-bioi611 0 Sep 8 22:52 test.txt
Here, the first character represents the type of file (e.g., -
for a regular file or d
for a directory), followed by three groups of three characters, each representing the permissions for the user
, group
, and others
, respectively.
Examples:
-rwxr-xr--
: The owner
has read
, write
, and execute
permissions. The group has read
and execute
permissions, while others can only read the file.
drwxr-x---
: A directory where the owner can read, write, and access (execute). The group can only read and access, while others have no permissions.
Modify file permissions using the chmod
command. Permissions can be set in two ways:
Symbolic Mode:
In symbolic mode, you modify permissions by referencing the categories (user, group, other) and specifying whether you're adding (+), removing (-), or setting (=) permissions.
# Add execute permission for the user:
chmod u+x filename
# Remove write permission for the group:
chmod g-w filename
# Set read-only permission for others:
chmod o=r filename
Symbolic mode is intuitive and flexible, especially when you want to make precise adjustments to permissions without affecting other categories. This is useful for common file-sharing tasks in bioinformatics where you need to tweak access for specific collaborators.
Numeric Mode (Octal representation):
In numeric mode, file permissions are set using a three-digit number. Each digit represents the permissions for
user,
group, and
other, respectively. The digits are calculated by adding the values of the
read,
write, and
execute` permissions:
- Read (r) = 4
- Write (w) = 2
- Execute (x) = 1
Example Permission Breakdown:
Read (r), Write (w), and Execute (x) for user = 7
Read (r) and Execute (x) for group = 5
Read (r) only for others = 4
chmod 754 filename
An example to help you understand executable
:
%%bash
printf '#!/user/bin/python\nprint("Hello, Welcome to Course BIOI611!")' > ~/test.py
%%bash
ls -l ~/test.py
python ~/test.py
-rw-r--r--. 1 xie186 zt-bioi611 61 Sep 8 23:06 /home/xie186/test.py
Hello, Welcome to Course BIOI611!
Error message below will be thrown out if you consider ~/test.py
as a program:
bash: line 1: /home/xie186/test.py: No such file or directory
%%bash
chmod u+x ~/test.py
ls -l ~/test.py
python ~/test.py
rm ~/test.py
-rwxr--r--. 1 xie186 zt-bioi611 61 Sep 8 23:06 /home/xie186/test.py
Hello, Welcome to Course BIOI611!
Disk Usage of Files and Directories
The Linux du
(short for Disk Usage) is a standard Unix/Linux command, used to check the information of disk usage of files and directories on a machine. The du command has many parameter options that can be used to get the results in many formats. The du
command also displays the files and directory sizes in a recursively manner.
%%bash
du -h ~/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref
2.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref
%%bash
du -ah ~/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref
2.9M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/sjdbList.fromGTF.out.tab
7.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/Log.out
936M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/SA
1.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/SAindex
3.0M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/transcriptInfo.tab
2.3M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/sjdbList.out.tab
1.5M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/geneInfo.tab
1.0K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/genomeParameters.txt
512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrLength.txt
512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrNameLength.txt
512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrStart.txt
7.6M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/exonGeTrInfo.tab
3.1M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/exonInfo.tab
2.8M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/sjdbInfo.txt
512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrName.txt
119M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/Genome
2.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref
%%bash
du -csh /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/*
19G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/raw_data
0 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/raw_data_smart_seq
1.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_data.sub
575K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_smart_seq-7478223-xie186.err
0 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_smart_seq-7478223-xie186.out
8.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_smart_seq.sub
2.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s2_star.sub
34G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_align
2.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref
512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/test.sub
512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/test.txt
55G total
Symbolic link
Symbolic link, similar to shortcuts, can point to another file/folder.
ln -s <path_to_files/folder_to_be_linked> <symlink_to_be_created>
ls -l <symlink>
unlink <symlink>
File Management and Data Handling
Compressing and decompressing files (gzip, gunzip, tar).
Compress one file:
%%bash
perl -e 'for($i=0; $i<10000; ++$i){ print "test\n";}' > test.txt
du -h test.txt
gzip test.txt
du -h test.txt.gz
gunzip test.txt
ls test.txt
rm test.txt
52K test.txt
4.0K test.txt.gz
test.txt
Compress multiple files:
%%bash
perl -e 'for($i=0; $i<10000; ++$i){ print "test\n";}' > test1.txt
perl -e 'for($i=0; $i<10000; ++$i){ print "test\n";}' > test2.txt
du -h test1.txt test2.txt
tar zcvf test.tar.gz test1.txt test2.txt
du -sh test.tar.gz
ls test1.txt test2.txt
52K test1.txt
52K test2.txt
test1.txt
test2.txt
4.0K test.tar.gz
test1.txt
test2.txt
z
: This option tells tar to compress the archive using gzip. The resulting archive will have a .gz extension to indicate that it has been compressed with the gzip utility.
c
: This option stands for create. It instructs tar to create a new archive.
v
: This stands for verbose. When used, tar will display detailed information about the files being added to the archive, such as their names.
f
: This stands for file. It tells tar that the next argument (test.tar.gz) is the name of the archive file to create.
%%bash
tar tvf test.tar.gz
rm test.tar.gz test1.txt test2.txt
-rw-r--r-- xie186/zt-bioi611 50000 2024-08-25 21:52 test1.txt
-rw-r--r-- xie186/zt-bioi611 50000 2024-08-25 21:52 test2.txt
t
: List the contents of archive.tar.
v
: Display additional details about each file (like file permissions, size, and modification date).
f
: Specifies that archive.tar is the archive file to operate on.
To uncompress a tar.gz
file, use tar zxvf
:
tar zxvf test.tar.gz
Transferring files within the network
Basic Syntax of scp
:
scp [options] source destination
Copy a Local File to a Remote Server
scp file.txt username@remote_host:/path/to/destination/
Alternative command is rsync
.
File searching, filtering, and text processing
Command find
The find
command is designed for comprehensive file and directory sesarches.
find [path] [options] [expression]
%%bash
find /home/xie186/scratch/bioi611/bulk_RNAseq -name "*.fastq.gz"
/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day7_rep3.fastq.gz
/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day1_rep3.fastq.gz
/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day1_rep1.fastq.gz
/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day7_rep1.fastq.gz
/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day1_rep2.fastq.gz
/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day7_rep2.fastq.gz
Text data counts wc
%%bash
find /home/xie186/scratch/bioi611/bulk_RNAseq -name "*.fastq.gz" |wc -l
6
Pipe |
In Linux and Unix-based systems, the pipe (|
) is used in the command line to redirect the output of one command as the input to another command. This allows you to chain commands together and perform more complex tasks in a single line.
%%bash
grep '>' ~/scratch/bioi611/reference/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa |wc -l
7
Column filering
Command cut
can be used to print selected parts of lines from each FILE to standard output.
%%bash
wget -O GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz "https://ncbi.nlm.nih.gov/geo/download/?type=rnaseq_counts&acc=GSE102537&format=file&file=GSE102537_raw_counts_GRCh38.p13_NCBI.tsv.gz"
--2024-08-25 21:08:03-- https://ncbi.nlm.nih.gov/geo/download/?type=rnaseq_counts&acc=GSE102537&format=file&file=GSE102537_raw_counts_GRCh38.p13_NCBI.tsv.gz
Resolving ncbi.nlm.nih.gov (ncbi.nlm.nih.gov)... 2607:f220:41e:4290::110, 130.14.29.110
Connecting to ncbi.nlm.nih.gov (ncbi.nlm.nih.gov)|2607:f220:41e:4290::110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 349584 (341K) [application/octet-stream]
Saving to: ‘GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz’
0K .......... .......... .......... .......... .......... 14% 6.66M 0s
50K .......... .......... .......... .......... .......... 29% 16.9M 0s
100K .......... .......... .......... .......... .......... 43% 27.5M 0s
150K .......... .......... .......... .......... .......... 58% 10.1M 0s
200K .......... .......... .......... .......... .......... 73% 17.2M 0s
250K .......... .......... .......... .......... .......... 87% 37.6M 0s
300K .......... .......... .......... .......... . 100% 10.5M=0.02s
2024-08-25 21:08:04 (13.4 MB/s) - ‘GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz’ saved [349584/349584]
%%bash
zcat GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz |head
GeneID GSM2740270 GSM2740272 GSM2740273 GSM2740274 GSM2740275
100287102 9 17 14 14 19
653635 336 470 467 310 370
102466751 8 56 46 31 31
107985730 0 2 2 3 3
100302278 0 1 0 0 2
645520 0 3 8 4 7
79501 0 2 2 1 4
100996442 16 25 34 20 28
729737 19 39 33 22 26
%%bash
zcat GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz |cut -f1,2,3 |head
GeneID GSM2740270 GSM2740272
100287102 9 17
653635 336 470
102466751 8 56
107985730 0 2
100302278 0 1
645520 0 3
79501 0 2
100996442 16 25
729737 19 39
Row filtering
%%bash
grep '>' ~/scratch/bioi611/reference/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa
>I dna:chromosome chromosome:WBcel235:I:1:15072434:1 REF
>II dna:chromosome chromosome:WBcel235:II:1:15279421:1 REF
>III dna:chromosome chromosome:WBcel235:III:1:13783801:1 REF
>IV dna:chromosome chromosome:WBcel235:IV:1:17493829:1 REF
>V dna:chromosome chromosome:WBcel235:V:1:20924180:1 REF
>X dna:chromosome chromosome:WBcel235:X:1:17718942:1 REF
>MtDNA dna:chromosome chromosome:WBcel235:MtDNA:1:13794:1 REF
%%bash
zcat GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz |wc -l
zcat GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz |awk '$2>500' |wc -l
zcat GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz |awk '$2>500 && $3>500' |wc -l
39377
8773
3820
Text processing
%%bash
grep '>' ~/scratch/bioi611/reference/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa |sed 's/>//' |sed 's/ .*//'
I
II
III
IV
V
X
MtDNA
Regular Expressions
Regular expressions are sequences of characters that define search patterns. They are commonly used for string matching, searching, and text processing.
Regex is used in text editors, programming languages, command-line tools (likegrep
and sed
), and many bioinformatics tools to search, replace, or extract data from text.
- Metacharacters: Special characters that have specific meanings in regex syntax.
.
(dot): Matches any single character except a newline. Example:A.G
matches "AAG", "ATG", "ACG", etc.
^
: Matches the start of a line.
Example: ^A
matches any line starting with "A".
$
: Matches the end of a line.
Example: end$
matches any line ending with "end".
*
: Matches 0 or more occurrences of the preceding character.
Example: ca*t
matches "ct", "cat", "caat", "caaat", etc.
+
: Matches 1 or more occurrences of the preceding character.
Example: ca+t
matches "cat", "caat", "caaat", etc.
?
: Matches 0 or 1 occurrence of the preceding character.
Example: colou?r
matches both "color" and "colour".
[]
: Matches any one of the characters inside the brackets.
Example: [aeiou]
matches any vowel.
|
: Alternation (OR) operator.
Example: cat|dog
matches either "cat" or "dog".
- Character Classes: Represents a set of characters.
\d
: Matches any digit (equivalent to [0-9]).
\w
: Matches any word character (alphanumeric or underscore).
\s
: Matches any whitespace character (spaces, tabs, etc.).
\D
: Matches any non-digit character.
\W
: Matches any non-word character.
\S
: Matches any non-whitespace character.
- Quantifiers: Specify the number of occurrences to match
{n}
: Matches exactly n occurrences.
Example: A{3} matches "AAA".
{n,}
: Matches n or more occurrences.
Example: T{2,} matches "TT", "TTT", "TTTT", etc.
{n,m}
: Matches between n and m occurrences.
Example: G{1,3} matches "G", "GG", or "GGG".
An example of the command line used
%%bash
grep -v '#' ~/scratch/bioi611/reference/Caenorhabditis_elegans.WBcel235.111.gtf \
|awk '$3=="gene"' \
|sed 's/.*gene_biotype "//' \
|sed 's/";//'|sort |uniq -c \
| sort -k1,1n
22 rRNA
100 antisense_RNA
129 snRNA
194 lincRNA
261 miRNA
346 snoRNA
634 tRNA
2128 pseudogene
7764 ncRNA
15363 piRNA
19985 protein_coding
Environment variables
Environment variables are dynamic values that affect the behavior of processes and programs in Linux. They are commonly used to store configuration data and are essential in bioinformatics workflows for defining paths to software, libraries, and datasets.
Commonly Used Environment Variables:
PATH
:
The PATH
variable specifies directories where the system looks for executable files when a command is run.
%%bash
echo $PATH
/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/texlive/bin/x86_64-linux:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/imagemagick/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/graphviz/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/ghostscript/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/ffmpeg/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/mpi-nocuda/linux-rhel8-zen2/gcc/11.3.0/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/nompi-nocuda/linux-rhel8-zen2/gcc/11.3.0/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/linux-rhel8-x86_64/gcc-rh8-8.5.0/gcc-11.3.0-oedkmii7vhd6rbnqm6xufmg7d3jx4w6l/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/linux-rhel8-zen2/gcc-11.3.0/py-jupyter-1.0.0-trwwgzwljql55mhmaygcuxb3nvaevjsu/bin:/software/acigs-utilities/bin:/home/xie186/miniforge3/bin:/home/xie186/miniforge3/condabin:/home/xie186/SHELL.bioi611/software/STAR_2.7.11b/Linux_x86_64_static:/home/xie186/.local/bin:/home/xie186/bin:/software/acigs-utilities/bin:/usr/share/Modules/bin:/usr/lib/heimdal/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/symas/bin:/opt/dell/srvadmin/bin
HOME
:
The HOME
variable stores the path to the user’s home directory.
%%bash
echo $HOME
/home/xie186
%%bash
echo $SHELL
/bin/bash
Setting Environment Variables:
Temporarily setting a variable (valid only for the current shell session):
export PATH=value:PATH
Permanently setting a variable:
To make the environment variable persistent across sessions,
it needs to be added to configuration files like .bashrc
or .bash_profile
.
Example: Add the following line to .bashrc
:
Software installation
Installation via Conda
Conda is a popular package management system, especially in bioinformatics, due to its ability to create isolated environments. This is crucial when working with tools that have conflicting dependencies.
- Install
conda/miniforge
Go to: https://github.com/conda-forge/miniforge/releases Download the corresponding installtion file
%%bash
uname -m
x86_64
wget https://github.com/conda-forge/miniforge/releases/download/24.7.1-0/Mambaforge-24.7.1-0-Linux-x86_64.sh
- Create conda environment and install software
conda create -n bioi611
conda activate bioi611
conda install bioconda::fastqc==0.11.8
Installation via Source Code (Manual Compilation)
git clone https://github.com/lh3/bwa.git
cd bwa; make
./bwa index ref.fa
Using Container for Bioinformatics Tools
https://hub.docker.com/r/biocontainers/bwa/
module load singularity
singularity build bwa_v0.7.17_cv1.sif docker://biocontainers/bwa:v0.7.17_cv1
Text editor in Linux
In Linux, we sometimes need to create or edit a text file like writing a new perl script. So we need to use text editor.
As a newbie, someone would prefer a basic, GUI-based text editor with menus and traditional CUA key bindings. Here we recommend Sublime, ATOM and Notepad++.
But GUI-based text editor is not always available in Linux.
A powerful screen text editor vi
(pronounced “vee-eye”) is available on nearly all Linux system. We highly recommend vi
as a text editor, because something we’ll have to edit a text file on a system without a friendlier text editor. Once we get familiar with vi
, we’ll find that it’s very fast and powerful.
But remember, it’s OK if you think this part is too difficult at the beginning. You can use either Sublime
, ATOM
or Notepad++
. If you are connecting to a Linux system without Sublime
, ATOM
and Notepad++
, you can write the file in a local computer and then upload the file onto Linux system.
Basic vi
skills
As vi
uses a lot of combination of keystrokes, it may be not easy for newbies to remember all the combinations in one fell swoop. Considering this, we’ll first introduce the basic skills someone needs to know to use vi
. We need to first understand how three modes of vi
work and then try to remember a few basic vi
commonds. Then we can use these skills to write Perl or R scripts in the following chaptors for Perl and R (Figure \@ref(fig:workingModeVi)).
Three modes of vi
:
Create new text file with vi
mkdir test_vi ## generate a new folder
cd test_vi ## go into the new folder
echo "Using \`ls\` we don't expect files in this folder."
ls
echo "No file displayed!"
Using the code above, we made a new directory named test_vi
. We didn't see any file.
If we type vi test.py
, an empty file and screen are created into which you may enter text because the file does not exist((Figure \@ref(fig:ViNewFile))).
vi test.py
A screentshot of the vi test.py
.
Now if you are in vi mode
. To go to Input mode
, you can type i
, 'a' or 'o' (Figure \@ref(fig:ViInpuMode)).
A screentshot of the vi test.py
.
Now you can type the content (codes or other information) (\@ref(fig:ViInpuType)).
Once you are done typing. You need to go to Command mode
(Figure \@ref(fig:workingModeVi)) if you want to save and exit the file. To do this, you need to press ESC
button on the keyboard.
Now we just wrote a Perl script. We can run this script.
python test.py
High-Performance Computing (HPC) for Bioinformatics
HPC resources enable bioinformatics analyses that require significant computational power and memory.
Basics of HPC clusters and job schedulers (SLURM).
An example of an job file (s1_star.sh
):
#!/bin/bash
#SBATCH --partition=standard
#SBATCH -t 40:00:00
#SBATCH -n 1
#SBATCH -c 20
#SBATCH --job-name=s1_star_aln
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out
conda activate bioi611
mkdir -p STAR_align/
STAR --genomeDir STAR_ref \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic \
--quantMode TranscriptomeSAM GeneCounts \
--readFilesCommand zcat \
--outFileNamePrefix STAR_align/N2_day1_rep1. \
--runThreadN 20 \
--readFilesIn raw_data/N2_day1_rep1.fastq.gz
To submit this job, run:
sbatch s1_star.sh
Check quota infomation
%%bash
scratch_quota
# shell_quota
# Group quotas
Group name Space used Space quota % quota used
zt-bioi611 285.811 MB 4.000 TB 0.01%
zt-bioi611_mgr 98.163 GB unlimited 0
total 98.449 GB unlimited 0
# User quotas
User name Space used Space quota % quota used % of GrpTotal
xie186 98.449 GB unlimited 0 100.00%
View information about Slurm nodes and partitions.
%%bash
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug up 15:00 1 maint compute-b8-60
debug up 15:00 1 drng compute-b8-57
debug up 15:00 1 mix compute-b8-59
debug up 15:00 1 alloc compute-b8-58
scavenger up 14-00:00:0 1 inval compute-b8-48
scavenger up 14-00:00:0 4 drain$ compute-b8-[53-56]
scavenger up 14-00:00:0 84 maint compute-a7-[5,9,14-16,28,49],compute-a8-[2-4,8-9,15,18,22,24,29,37,44,51],compute-b5-[4,16,26,29-30,33,44,51-52],compute-b6-[7,12,21,28-29,32,34,43-46,50-51,59],compute-b7-[12-13,19-22,25,27,29,31,35,37,39,42,45-46,49-50,54,56-59],compute-b8-[16,19,21,23-24,29,32,35-37,39-45,60]
scavenger up 14-00:00:0 2 drain* compute-a7-[13,43]
scavenger up 14-00:00:0 13 drng compute-a8-[7,14],compute-b7-[14-15,18,38,43-44],compute-b8-[2,20,51,57],gpu-b9-5
scavenger up 14-00:00:0 2 drain compute-a7-8,gpu-b10-5
scavenger up 14-00:00:0 182 mix bigmem-a9-[1-2,4-5],compute-a5-[3-11],compute-a7-[2-3,6-7,10,12,17-19,21-22,30,38-40,45-46,48,54-56,60],compute-a8-[5-6,10-12,16-17,19-21,25,28,31-35,39,41,45,47,50,52,54,57-59],compute-b5-[1-3,5-8,11,13-15,17-25,27-28,31-32,34-43,45-50,53-55,57-58],compute-b6-[1-5,14-15,17-20,22-24,35-36,48-49,52,54],compute-b7-[1,7-8,16-17,23-24,26,28,30,32-34,36,40-41,47-48,51-52,55,60],compute-b8-[1,15,17-18,22,25-27,30-31,33,46-47,49-50,59],gpu-b9-[1-4,6-7],gpu-b10-[1-3,6-7],gpu-b11-[1-6]
scavenger up 14-00:00:0 93 alloc bigmem-a9-[3,6],compute-a7-[1,4,11,20,23-27,29,31-37,41-42,44,47,50-53,57-59],compute-a8-[1,13,23,26-27,30,36,38,40,42-43,46,48-49,53,55-56,60],compute-b5-[9-10,12,56,59-60],compute-b6-[6,8-11,13,16,27,30-31,58,60],compute-b7-[2-6,9-11,53],compute-b8-[3-14,28,34,38,52,58],gpu-b10-4
scavenger up 14-00:00:0 14 idle compute-b6-[25-26,33,37-42,47,53,55-57]
standard* up 7-00:00:00 1 inval compute-b8-48
standard* up 7-00:00:00 4 drain$ compute-b8-[53-56]
standard* up 7-00:00:00 82 maint compute-a7-[5,9,14-16,28,49],compute-a8-[2-4,8-9,15,18,22,24,29,37,44,51],compute-b5-[4,16,26,29-30,33,44,51-52],compute-b6-[7,12,21,28-29,32,34,43-46,50-51],compute-b7-[12-13,19-22,25,27,29,31,35,37,39,42,45-46,49-50,54,56-59],compute-b8-[16,19,21,23-24,29,32,35-37,39-45]
standard* up 7-00:00:00 2 drain* compute-a7-[13,43]
standard* up 7-00:00:00 11 drng compute-a8-[7,14],compute-b7-[14-15,18,38,43-44],compute-b8-[2,20,51]
standard* up 7-00:00:00 1 drain compute-a7-8
standard* up 7-00:00:00 159 mix compute-a5-[3-11],compute-a7-[2-3,6-7,10,12,17-19,21-22,30,38-40,45-46,48,54-56,60],compute-a8-[5-6,10-12,16-17,19-21,25,28,31-35,39,41,45,47,50,52,54,57-59],compute-b5-[1-3,5-8,11,13-15,17-25,27-28,31-32,34-43,45-50,53-55,57-58],compute-b6-[1-5,14-15,17-20,22-24,35-36,48-49,52],compute-b7-[1,7-8,16-17,23-24,26,28,30,32-34,36,40-41,47-48,51-52,55,60],compute-b8-[1,15,17-18,22,25-27,30-31,33,46-47,49-50]
standard* up 7-00:00:00 87 alloc compute-a7-[1,4,11,20,23-27,29,31-37,41-42,44,47,50-53,57-59],compute-a8-[1,13,23,26-27,30,36,38,40,42-43,46,48-49,53,55-56,60],compute-b5-[9-10,12,56,59-60],compute-b6-[6,8-11,13,16,27,30-31],compute-b7-[2-6,9-11,53],compute-b8-[3-14,28,34,38,52]
standard* up 7-00:00:00 10 idle compute-b6-[25-26,33,37-42,47]
serial up 14-00:00:0 1 maint compute-b6-59
serial up 14-00:00:0 1 mix compute-b6-54
serial up 14-00:00:0 2 alloc compute-b6-[58,60]
serial up 14-00:00:0 4 idle compute-b6-[53,55-57]
gpu up 7-00:00:00 1 down$ gpu-a6-3
gpu up 7-00:00:00 1 drng gpu-b9-5
gpu up 7-00:00:00 1 drain gpu-b10-5
gpu up 7-00:00:00 19 mix gpu-a6-[6,8],gpu-b9-[1-4,6-7],gpu-b10-[1-3,6-7],gpu-b11-[1-6]
gpu up 7-00:00:00 1 alloc gpu-b10-4
gpu up 7-00:00:00 6 idle gpu-a5-1,gpu-a6-[2,4-5,7,9]
bigmem up 7-00:00:00 4 mix bigmem-a9-[1-2,4-5]
bigmem up 7-00:00:00 2 alloc bigmem-a9-[3,6]
Check partitial information
%%bash
scontrol show partition standard
PartitionName=standard
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=00:15:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=compute-a5-[3-11],compute-a7-[1-60],compute-a8-[1-60],compute-b5-[1-60],compute-b6-[1-52],compute-b7-[1-60],compute-b8-[1-56]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
OverTimeLimit=NONE PreemptMode=REQUEUE
State=UP TotalCPUs=45696 TotalNodes=357 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=45696,mem=178500G,node=357,billing=45696
TRESBillingWeights=CPU=1.0,Mem=0.25G
Display node config information
%%bash
scontrol show node compute-a5-3
NodeName=compute-a5-3 Arch=x86_64 CoresPerSocket=64
CPUAlloc=71 CPUEfctv=128 CPUTot=128 CPULoad=68.89
AvailableFeatures=rhel8,amd,epyc_7702,ib
ActiveFeatures=rhel8,amd,epyc_7702,ib
Gres=(null)
NodeAddr=compute-a5-3 NodeHostName=compute-a5-3 Version=23.11.9
OS=Linux 4.18.0-553.5.1.el8_10.x86_64 #1 SMP Tue May 21 03:13:04 EDT 2024
RealMemory=512000 AllocMem=296960 FreeMem=326630 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=300000 Weight=1 Owner=N/A MCS_label=N/A
Partitions=scavenger,standard
BootTime=2024-08-08T18:32:48 SlurmdStartTime=2024-08-12T17:43:23
LastBusyTime=2024-08-12T17:43:19 ResumeAfterTime=None
CfgTRES=cpu=128,mem=500G,billing=128
AllocTRES=cpu=71,mem=290G
CapWatts=n/a
CurrentWatts=630 AveWatts=294
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
CPU Details: * Total CPUs: 128 * Allocated CPUs: 71
Memory: * Total Memory: 500 GB * Allocated Memory: 290 GB * Free Memory: ~319 GB
View information about jobs located in the Slurm scheduling queue.
%%bash
squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7563417 standard sys/dash xie186 R 48:15 1 compute-a5-5
Cancel a job
%%bash
scancel <JOBID>