[ad_1]
The most useful computer science class you’ve probably never taken
One thing that I have consistently observed throughout my career is that the most productive data scientists and engineers have usually one thing in common: they’re command-line wizards. They can navigate their computer’s file system, search for patterns in log files, and manage jobs, source code, and version control all from the command line, without relying on slow navigation with the mouse and graphical user interfaces.
Yet, this command-line ‘wizardry’, as it may appear to someone unfamiliar with shell tools, is not typically part of standard computer science curricula. An MIT course around mastering your command line is aptly named “The Missing Semester of Your CS Education”.
This post is my personal, 10-lesson ‘command-line wizardry 101’ class, targeted for readers that want to work more with the command line and less with graphical user interfaces. We’ll cover basics around the shell and the path variable, aliases, file permissions, streaming and piping, efficient job management, tmux, ssh, git, and vim.
Let’s get started. Welcome to CLW 101.
1. The shell
When you open your terminal, you’re looking at a shell, such as bash (borne again shell) or ZSH (z-shell). The shell really is a complete programming language with access to certain standard programs that allow for file system navigation and data manipulation. You can find out which shell you’re running by typing:
echo $SHELL
In bash, each time you start a new shell, the shell loads a sequence of commands that are specified inside the .bashrc
file, which is typically in your home directory (if you use a Mac, there’s usually a .bash_profile
file instead). In that file you can specify useful things such as your path variable or aliases (more on which below).
2. The path variable
When you type the name of certain programs into your shell, such as python
, cat
, or ls
, how does the shell know where to get that program from? That’s the purpose of the path variable. This variable stores a list of all paths where the shell looks for programs, separated by colons. You can inspect your path variable by typing:
echo $PATH
And you can add additional directories to your path variable with this command:
export PATH="my_new_path:$PATH"
It’s best to add this command to your bashrc file, so that your additional directory is always in your path when you start a new shell.
3. Aliases
Aliases are custom commands that you can define in order to avoid typing lengthy commands over and over again, such as:
alias ll="ls -lah"
alias gs="git status"
alias gp="git push origin master"
Aliases can also be used to create safeguards for your development workflow. For example, by defining
alias mv="mv -i"
your terminal will warn you if the file you’re about to move does already exist under the new directory, so that you don’t accidentally overwrite files that you didn’t mean to overwrite.
Once you add these aliases into your bashrc file, they’re always available when you start a new shell.
4. File permissions and sudo
When multiple users share a machine, it’s important to set file permissions that determine which user can perform which operations on what data. When you type ls -l
, you’ll see the files in your current directory along with their permissions in the following form:
-rwxrwxrwx
Here,
rwx
stand for read, write, and execute rights, respectively- the 3
rwx
blocks are for (1) user, (2) user group, and (3) everyone else. In the given example, all 3 of these entities have read, write, as well as execute permissions. - the dash indicates that this is a file. Instead of the dash, you can also see a
d
for directory orl
for a symbolic link.
You can edit file permissions with chmod
. For example, if you want to make a file executable for yourself, you’d type
chmod u+x my_program.py
👉 If a file is executable, how does the shell know how to execute it? This is specified with a ‘hashbang’ in the first row of the file, such as
#!/bin/bash
for a bash script or#!/bin/python
for a python script.
Lastly, there’s a special ‘super user’ who has all of the permissions for all of the files. You can run any command as that super user writing sudo
in front of that command. You can also launch a stand-alone sudo shell by executing
sudo su
⚠️ Use sudo with care. With sudo, you’re able to make changes to the code that controls your computer’s hardware, and a mistake could make your machine unusable. Remember, with great power comes great responsibilty.
5. Streaming and piping
The streaming operator >
redirects the output from a program to a file. >>
does the same thing, but it’s appending to an existing file instead of overwriting it, if it already exists. This is useful for logging your own programs like this:
python my_program.py > logfile
Another useful concept is piping: x | y
executes program x, and the directs the output from x into program y. For example:
cat log.txt | tail -n5
: prints the last 5 lines from log.txtcat log.txt | head -n5
: prints the first 5 lines from log.txtcat -b log.txt | grep error
: shows all lines in log.txt that contain the string ‘error’, along with the line number (-b)
6. Managing jobs
If you run a program from your command line (e.g. python run.py
), the program will by default run in the foreground, and prevent you from doing anything else until the program is done. While the program is running in the foreground, you can:
- type control+C, which will send a SIGINT (signal interrupt) signal to the program, which instructs the machine to interrupt the program immediately (unless the program has a way to handle these signals internally).
- type control+Z, which will pause the program. After pausing the program can be continued either by bringing it to the foreground (
fg
), or by sending it to the backgroud (bg
).
In order to start your command in the background right away, you use the &
operator:
python run.py &
👉 How do you know which programs are currently running in the background? Use the command
jobs
. This will display the names of the jobs running as well as their process ids (PIDs).
Lastly,kill
is a program to send signals to programs running in the background. For example,
kill -STOP %1
sends a STOP signal, pausing program 1.kill -KILL %1
sends a KILL signal, terminating program 1 permanently.
7. tmux
tmux
(‘terminal multiplexer’) enables you to easily create new terminals and navigate between them. This can be extremely useful, for example you can use one terminal to navigate your file system and another terminal to execute jobs. With tmux, you can even have both of these side-by-side.
👉 Another reason to learn tmux is remote development: when you log out of a remote machine (either on purpose or accidentally), all of the programs that were actively running inside your shell are automatically terminated. On the other hand, if you run your programs inside a tmux shell, you can come simply detach the tmux window, log out, close your computer, and come back to that shell later as if you’ve never been logged out.
Here are some basic commands to get you started with tmux:
tmux new -s run
creates new terminal session with name ‘run’- control-BD: detach this window
tmux a
: attach to latest windowtmux a -t run
: attach to window called ‘run’- control-B“ : add another terminal pane below
- control-B% : add another terminal pane to the right
- control-B➡️ : move to the terminal pane to the right (similar for left, up, down)
8. SSH and key pairs
ssh
is a program for logging into remote machines. In order to log into remote machines, you’ll need to provide either a username and password, or you use a key pair, consisting of a public key (which both machines have access to) and a private key (which only your own machine has access to).
ssh-keygen
is a program for generating such a key pair. If you run ssh-keygen
, it will by default create a public key named id_rsa.pub
and a private key named id_rsa
, and place both into your ~/.ssh
directory. You’ll need to add the public key to the remote machine, which, as you should know by now, you can do by piping together cat
, ssh
, and a streaming operator:
cat .ssh/id_rsa.pub | ssh user@remote 'cat >> ~/.ssh/authorized_keys'
Now you’ll be able to use ssh into remote just by providing your your private key:
ssh remote -i ~/.ssh/id_rsa
An even better practice is to create a file ~/.ssh/config
which contains all of your ssh authentication configurations. For example, if your config
file is as follows:
Host dev
HostName remote
IdentityFile ~/.ssh/id_rsa
Then you can log into remote by simply typing ssh dev
.
9. git
git
is a version control system that allows you to allows you to efficiently navigate your code’s versioning history and branches from the command line.
👉 Note that
git
is not the same as GitHub:git
is a stand-alone program that can manage your code’s versioning on you local laptop, while GitHub is a place to host your code remotely.
Here are some essential git commands:
git add
: specifies which files you want to include in the next commitgit commit -m 'my commit message'
: commits the code changegit checkout -b dev
: creates a new branch named ‘dev’ and check out that branchgit merge dev
: merges dev into the current branch. If this creates merge conflicts, you’ll need to fix these conflicts manually, and then rungit add file_that_changed; git merge --continue
git stash
: reverts all changes, andgit stash pop
brings them back. This is useful if you made changes to the master branch, and then decide that you actually want those changes to be a separate branch.git reset --hard
: reverts all changes permanently
And here are some essential git commands for dealing with a remote host (e.g. GitHub):
git clone
: clones a copy of the code repo to your local machinegit push origin master
: pushes the changes to the remote host (e.g. GitHub)git pull
: pulls the latest version from remote. (This is the same as runninggit fetch; git merge;
).
👉 Before being able to run a command such as
git push origin master
, you’ll need to authenticate with an ssh keypair (see Lesson 8). If you use GitHub, you can simply paste the public key under your profile settings.
10. vim
Vim is a powerful command-line based text editor. It’s a good idea to learn at least the very basic commands in vim:
- every once in a while you may have to log into a remote machine and make a code change there. vim is a standard program and therefore usually available on any machine you work on.
- when running
git commit
, by default git opens vim for writing a commit message. So at the very least you’ll want to know how to write, save, and close a file.
The most important thing to understand about vim is that there are different operation modes. Once you launch vim, you’re inside navigation mode, which you use to navigate through the file. Type i
to start edit mode, in which you can make changes to the file. Type the Esc
key to leave edit mode and go back to navigation mode.
The useful thing about navigation mode is that you’re able to rapidly navigate and manipulate the file with your keyboard, for example:
x
deletes a characterdd
deletes an entire rowb
(back) goes to the previous word,n
(next) goes to the next word:wq
saves your changes and closes the file:q!
ignores your changes and closes the file
For more (much more!) vim keyboard shortcuts, check out this vim cheatsheet.
Final thoughts
Congratulations, you’ve completed ‘command line wizardry 101’. However, we’ve only scratched the surface here. For inspiration, consider the following problem:
“Given a text file and an integer
k
, print thek
most common words in the file (and the number of their occurrences) in decreasing frequency.”
As a data scientist, my first impulse may be to launch a jupyter notebook, load the data perhaps into pandas, and then use a function such as pandas agg
. However, to a seasoned command-line wizard, this is a one-liner:
tr -cs A-Za-z '' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
This doesn’t look too different from Stable Diffusion’s imagination shown in the beginning of this article. Wizardry, indeed.
How to become a command-line wizard Republished from Source https://towardsdatascience.com/how-to-become-a-command-line-wizard-5d78d75fbf0c?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed
<!–
–>
[ad_2]
Source link