Sed使用笔记

Posted by wsxq2 on 2019-04-20
TAGS:  sed

本文最后一次编辑时间:2019-10-19 15:26:12 +0800

本文主要提取自man sedSed 使用参考手册。也可以使用info sed com获得和本文相近的内容。对于更详细更完整的 sed 文档,可参阅 info sed

Sed 是一款流编辑工具,用来对文本进行过滤与替换操作。Sed 一次仅读取一行内容来对某些指令进行处理后输出,所以 Sed 处理大数据文件是很方便快捷的。

Sed 的工作流程是:首先,通过文件或者管道读取文件内容。Sed 默认并不直接修改源文件,而是将读入的内容赋值到缓冲区中,这被称之为模式空间,所有指令操作都是在模式空间中进行。然后, Sed 根据相应的指令对模式空间中的内容进行处理并输出,默认输出至标准输出(即屏幕)。

注意:sed 脚本的执行顺序不是由命令的出现的先后顺序决定的,而是由命令涉及的行号(address ranges)决定的,因为 sed 只遍历一遍待处理文本

sed 命令行参数

使用sed --help可得到如下帮助:

Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...

  -n, --quiet, --silent
                 suppress automatic printing of pattern space
  -e script, --expression=script
                 add the script to the commands to be executed
  -f script-file, --file=script-file
                 add the contents of script-file to the commands to be executed
  --follow-symlinks
                 follow symlinks when processing in place
  -i[SUFFIX], --in-place[=SUFFIX]
                 edit files in place (makes backup if SUFFIX supplied)
  -c, --copy
                 use copy instead of rename when shuffling files in -i mode
  -b, --binary
                 does nothing; for compatibility with WIN32/CYGWIN/MSDOS/EMX (
                 open files in binary mode (CR+LFs are not treated specially))
  -l N, --line-length=N
                 specify the desired line-wrap length for the `l' command
  --posix
                 disable all GNU extensions.
  -r, --regexp-extended
                 use extended regular expressions in the script.
  -s, --separate
                 consider files as separate rather than as a single continuous
                 long stream.
  -u, --unbuffered
                 load minimal amounts of data from the input files and flush
                 the output buffers more often
  -z, --null-data
                 separate lines by NUL characters
  --help
                 display this help and exit
  --version
                 output version information and exit

If no -e, --expression, -f, or --file option is given, then the first
non-option argument is taken as the sed script to interpret.  All
remaining arguments are names of input files; if no input files are
specified, then the standard input is read.

GNU sed home page: <http://www.gnu.org/software/sed/>.
General help using GNU software: <http://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
Be sure to include the word 'ed' somewhere in the 'ubject:' field.

sed 常用格式

1
2
3
4
5
sed [options] 'script' file1 file2 ...
sed [options] –f scriptfile file1 file2 ...
sed 'script' file | sed 'script' | sed 'script'
sed 'statement1; statement2; statement3' file1 file2 ...
sed -e 'statement1' -e 'statement2' -e 'statement3' file1 file2 ...

通常一个statement为:

1
2
3
4
5
address{
    command1
    command2
    command3
}

如果command只有一个,则{}可以省去

Zero-address commands

: label Label for b and t commands.
#comment The comment extends until the next newline (or the end of a -e script fragment).
} The closing bracket of a { } block.

Zero- or One- address commands

= Print the current line number.
a text Append text, which has each embedded newline preceded by a backslash.
i text Insert text, which has each embedded newline preceded by a backslash.
q [exit-code] Immediately quit the sed script without processing any more input, except that if auto-print is not disabled the current pattern space will be printed. The exit code argument is a GNU extension.
Q [exit-code] Immediately quit the sed script without processing any more input. This is a GNU extension.
r filename Append text read from filename.
R filename Append a line read from filename. Each invocation of the command reads a line from the file. This is a GNU extension.

Commands which accept address ranges

{ Begin a block of commands (end with a }).
b label Branch to label; if label is omitted, branch to end of script.
c text Replace the selected lines with text, which has each embedded newline preceded by a backslash.
d Delete pattern space. Start next cycle.
D If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.
h H Copy/append pattern space to hold space.
g G Copy/append hold space to pattern space.
l List out the current line in a ‘visually unambiguous’ form.
l width List out the current line in a ‘visually unambiguous’ form, breaking it at width characters. This is a GNU extension.
n N Read/append the next line of input into the pattern space.
p Print the current pattern space.
P Print up to the first embedded newline of the current pattern space.
s/regexp/replacement/ Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The replacement may contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \1 through \9 to refer to the corresponding matching sub-expressions in the regexp.
t label If a s/// has done a successful substitution since the last input line was read and since the last t or T command, then branch to label; if label is omitted, branch to end of script.
T label If no s/// has done a successful substitution since the last input line was read and since the last t or T command, then branch to label; if label is omitted, branch to end of script. This is a GNU extension.
w filename Write the current pattern space to filename.
W filename Write the first line of the current pattern space to filename. This is a GNU extension.
x Exchange the contents of the hold and pattern spaces.
y/source/dest/ Transliterate the characters in the pattern space which appear in source to the corresponding character in dest.

s命令 flag

n 1 - 512 之间的数字,表示对模式空间中指定模式的第 n 次出现进行替换。如一行中有3个A,而只想替换第二个A。
g 对模式空间的所有匹配进行全局更改。没有 g 则只有第一次匹配被替换。如一行中有3个A,则仅替换第一个A。
p 打印模式空间的内容,即表示打印行。与-n选项一起使用可以只打印匹配的行。
w file 将模式空间的内容写到文件 file 中。 即表示把行写入一个文件。

常见问题

取反

1
find . -type f -name "*" | sed -n '/.abc/!p'

条件判断

sed无法实现通常意义上的条件判断(即if..else..),它只有在特殊情况下才显得能够实现条件判断。比如欲解决如下问题:

1
2
3
if (sed command does find a match for "::=BEGIN")
then i=1 
else i=0

可以使用如下命令,显得能够实现条件判断的样子:

1
i=$(sed ':a;N;$!ba;s/\n/ /g' foo | sed '{/::=BEGIN/{s/.*/1/; b next}; s/.*/0/; :next}')

上述问题和解法均来自 bash - Use sed command to apply a condition to check if a pattern has been matched - Ask Ubuntu

但是对于我在处理缩略语时遇到的一个问题,经过数次尝试,使用 sed 无法实现,即使这个问题只涉及到一个简单的条件判断。当然,也有可能是我个人能力有限,没找到解决办法。关于这个问题的更多信息,参见 处理缩略语

添加空行

测试文本 a

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
---
tags: [sed]
last_modified_time: 2019-04-21 17:59:31 +0800
---

<!-- vim-markdown-toc GFM -->

* [sed 命令行参数](#sed-命令行参数)
* [sed 常用格式](#sed-常用格式)
* [Zero-address commands](#zero-address-commands)
* [Zero- or One- address commands](#zero--or-one--address-commands)
* [Commands which accept address ranges](#commands-which-accept-address-ranges)
* [`s`命令 flag](#s命令-flag)
* [常见问题](#常见问题)
  * [取反](#取反)

要求在<!-- vim-markdown-toc GFM -->一行前添加如下内容(注意有个空行):

1
<p id="markdown-toc"></p>

i\na\n

使用如下命令即可:

sed -e '/^<!-- vim-markdown-toc GFM -->$/{i\
<p id="markdown-toc"></p>\n
; }' a

i命令在指定行的前面插入文本,a命令在指定行的后面添加文本。由于ai极为类似,故在此不再赘述

{x;p;x}G

sed -e '/^<!-- vim-markdown-toc GFM -->$/{i\
<p id="markdown-toc"></p>
; {x;p;x}}' a

{x;p;x}在指定行前插入空行,G在指定行后添加空行。由于它们用法相似,故同样不再赘述

关于在更多位置(如文末)插入空行的方法参见 sed之添加空行 - 冰灵儿 - 博客园

实践记录

整理nmap -sP的输出结果

执行nmap -sP 192.168.56.0/24得到的输出:

Starting Nmap 6.40 ( http://nmap.org ) at 2019-04-20 18:06 CST
Nmap scan report for host (192.168.56.100)
Host is up (0.00014s latency).
MAC Address: 0A:00:27:00:00:1C (Unknown)
Nmap scan report for master (192.168.56.11)
Host is up.
Nmap done: 256 IP addresses (2 hosts up) scanned in 3.27 seconds

从该输出中提取 IP 地址:

1
sed -n 's/.*[^0-9]\(\([[:digit:]]\{1,\}\.\)\{3\}[[:digit:]]\{1,\}\).*/\1/p' nmap-sP-results.txt

整理nmap -sS -p22 -oG的输出结果

执行nmap -sS -p22 -oG a 202.117.0.0/16命令得到的输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Nmap 7.70 scan initiated Tue Dec  4 13:42:48 2018 as: nmap -sS -p22 -oG a 202.117.0.0/16
Host: 202.117.0.0 ()    Status: Up
Host: 202.117.0.0 ()    Ports: 22/filtered/tcp//ssh///
Host: 202.117.0.1 (0h1.xjtu.edu.cn) Status: Up
Host: 202.117.0.1 (0h1.xjtu.edu.cn) Ports: 22/filtered/tcp//ssh///
Host: 202.117.0.2 (0h2.xjtu.edu.cn) Status: Up
Host: 202.117.0.2 (0h2.xjtu.edu.cn) Ports: 22/filtered/tcp//ssh///
Host: 202.117.0.3 (0h3.xjtu.edu.cn) Status: Up
Host: 202.117.0.3 (0h3.xjtu.edu.cn) Ports: 22/filtered/tcp//ssh///
Host: 202.117.0.4 (0h4.xjtu.edu.cn) Status: Up
Host: 202.117.0.4 (0h4.xjtu.edu.cn) Ports: 22/filtered/tcp//ssh///
Host: 202.117.0.5 (voice-gw.xjtu.edu.cn)    Status: Up
Host: 202.117.0.5 (voice-gw.xjtu.edu.cn)    Ports: 22/filtered/tcp//ssh///
Host: 202.117.0.6 (0h6.xjtu.edu.cn) Status: Up
Host: 202.117.0.6 (0h6.xjtu.edu.cn) Ports: 22/filtered/tcp//ssh///
......

从该输出中提取信息:

1
2
3
4
5
6
7
8
9
10
11
# get active ip
sed -ne '/.*/{s/^Host: \([0-9.]\+\) .*/\1/p}' a

# get ssh-active ip
sed -ne '/22\/open\/tcp/{s/^Host: \([0-9.]\+\) (.*/\1/p}' a

# get domain-name-haved ip and domain-name
sed -ne '/.*/{s/^Host: \([0-9.]\+ ([^)]\+)\).*/\1/p}' a

# get ssh-active and domain-name-haved ip and domain-name
sed -ne '/22\/open\/tcp/{s/^Host: \([0-9.]\+ ([^)]\+)\).*/\1/p}' a

处理缩略语

有一次在写汇编语言相关的博客( 16位汇编程序设计 )时遇到了大量缩略语,处理时发现纯手打太麻烦,于是想到从网上下载一个缩略语文本,命名为 abbreviations.txt,通过一个简单的 Vim 宏从博客中提取所有缩略语到文末,再在 abbreviations.txt 中进行查找并替换,以减轻工作量。但发现其中的查找及替换竟难倒了我,于是不服,重新学了一遍 sed,信心满满地打算用一个 sed 脚本配合 shell 脚本实现。却发现自己失败了,特此记录之。后来使用 grep 配合 shell 脚本轻松实现,此外还有很多其它的 shell 版本。它们都在 Bash使用笔记 - 处理缩略语

问题详情

将文件 word.txt 中的每一行 word 作为搜索关键字在文件 abbreviations.txt 中进行查找,如果找到相应的行则输出找到的行(找到几行就输出几行),如果找不到相应的行,则输出*[$word]:

文件 word.txt:

ABC
ADD
AF
AH
AL
AMD
ASCII
ASSUME
AT
AX
...

文件 abbreviations.txt:

*[#!]: Shebang
*[/.]: Slashdot
*[100B-FX]: 100BASE-FX
*[100B-TX]: 100BASE-TX
*[100B-T]: 100BASE-T
*[100BVG]: 100BASE-VG
*[10B-FB]: 10BASE-FB
*[10B-FL]: 10BASE-FL
*[10B-FP]: 10BASE-FP
*[10B-F]: 10BASE-F
...
*[ABCL]: Actor-Based Concurrent Language
...
*[AF]: Auxiliary carry Flag(辅助进位标志)
...

想得到的输出:

*[ABC]:
*[ADD]:
*[AF]: Anisotropic Filtering
*[AF]: Auxiliary carry Flag(辅助进位标志)
*[AH]: Accumulator register High(寄存器 A 高位)
*[AH]: Active Hub
*[AL]: Access List
*[AL]: Accumulator register Low(寄存器 A 低位)
*[AL]: Active Link
*[AMD]: Advanced Micro Devices
*[ASCII]: American Standard Code for Information Interchange
*[ASSUME]:
*[AT]: Access Time

尝试过的方法

方法一:使用b命令
while read line
do
	sed -n -e "/\[$line]/{p; b next;}; 1a \
		\*\[$line]: 
		;:next;" abbreviations.txt
done < "${1:-/dev/stdin}"

最初企图模仿上述方法使用b命令实现条件判断。但是失败多次之后才知道,sed 的脚本不是顺序执行的。为了提高性能,它只遍历一次要修改的文本,因此,是以行号的递增为执行顺序的。比如对于以上脚本而言,1a *[$line]无论如何都会被执行,而非只要找到一个$line就可以跳过,因为上述 sed 脚本中的第二个命令1a *[$line]对应的行号是1,所以最先执行的就是它,然后在找到了$line后再执行里面p; b next;,但这里的b next已经毫无意义了

如果上述 sed 命令 显得过于复杂,导致难于理解,我们可以看个简单的 sed 命令来理解上述问题:

sed -n -e '/ABC/{p; q;}; 1a\
ABC
;' abbreviations.txt

我期待的是一旦找到 ABC 则打印当前行的内容后马上退出脚本(使用了q命令),如果没有找到则在第一行后面添加一行ABC。但实际情况为:即使找到了 ABC,依然会在第一行后面添加ABC。这便是因为 sed 会将以;或参数-e分隔的命令根据相关命令涉及的位置决定执行顺序的缘故。

于是我想到了这样:

while read line
do
	sed -n -e "/\[$line]/{p; b next;}; \$a \
		\*\[$line]: 
		;:next;" abbreviations.txt
done < "${1:-/dev/stdin}"

发现并没有解决我的问题。甚至不知道为什么会失败。感觉b命令好像是假的一样:sob:

方法二:去掉b命令

这个方法是上一个方法的最终方案的简化(去掉了疑似假的的b命令):

while read line
do
sed -n -e "/\[$line]/{p;}; \$a \
\*\[$line]: 
;" abbreviations.txt
done < "${1:-/dev/stdin}"

自然地,该方法也存在和上一个方法的最终解决方案一样的问题,即对于 word.txt 中的每一个 word,即使在 abbreviations.txt 中找到了相应的行也依然会输出*[$word]:

方法三:使用q命令
while read line
do
	sed -n -e "/\[$line]/{p; q;}; \$a \
		\*\[$line]: 
		;" abbreviations.txt
done < "${1:-/dev/stdin}"

该方法看起来不错,但依然存在问题,即只能在 abbreviations.txt 中找到一个和 word 匹配的行(因为使用了q命令)

链接

下面总结了本文中使用的所有链接: