去除fasta文件序列内部的换行符

有些不标准的fasta文件中序列内部会存在换行符,会影响一些软件对其分析,需要把这些换行符去掉。

例如,有这样一个序列文件:

>LOX-3
MLGGLKDKLTGKNGNKIKGLAVLMSRKLLDPRDFTASLLDNVHE
VFGNSITCQLVSATVADQNNEGRGIVGSEANLEQGLTDLPSVSQGESKLTVRFNWEMD
KHGVPGAIIIKNHHSTKFFLKTITLHDVPGCDTIVFVANSWIYPVGKYHYNRIFFANI
SYPPSQMPEALRPYREDELRYLRGEDRQGPYQEHDRIYRYDVYNDLGEPDRDNPRPVL
GGSQKHPYPRRGRTGRIPTKKDPNSESRLSLLEQIY
>AOC
MAAAAPSRVSVRAAAPGQTGGFAKIRPQVVVAAAARSAGVSGRR
ARSVRASLFSPKPATPKDARPAKVQEMFVYEINERDRESPAYLRLSAKQTENALGDLV
PFTNKLYSGSLDKRLGISAGICILIQHVPERNGDRYEAIYSFYFGDYGHISVQGPYLT
YEESYLAVTGGSGVFEGAYGQVKLNQIVFPFKIFYTFYLKGIPDLPRELLCTPVPPSP
TVEPTPAAKATEPHACLNNFTN

可以使用awk实现这一功能。

awk '!/^>/ { printf "%s", $0; n = "\n" }/^>/ { print n $0; n = "" }END { printf "%s", n }' test.fa > output.fa

查看输出文件,序列内部的换行符已删除:

less -S output.fa
>LOX-3
MLGGLKDKLTGKNGNKIKGLAVLMSRKLLDPRDFTASLLDNVHEVFGNSITCQLVSATVADQNNEGRGIVGSEANLEQGLTDLPSVSQGESKLTVRFNWEMDKHGVPGAIIIKNHHSTKFFLKTITLHDVPGCDTIVFVANSWIYPVGKYHYNRIFFANISYPPSQMPEALRPYREDELRYLRGEDRQGPYQEHDRIYRYDVYNDLGEPDRDNPRPVLGGSQKHPYPRRGRTGRIPTKKDPNSESRLSLLEQIY
>AOC
MAAAAPSRVSVRAAAPGQTGGFAKIRPQVVVAAAARSAGVSGRRARSVRASLFSPKPATPKDARPAKVQEMFVYEINERDRESPAYLRLSAKQTENALGDLVPFTNKLYSGSLDKRLGISAGICILIQHVPERNGDRYEAIYSFYFGDYGHISVQGPYLTYEESYLAVTGGSGVFEGAYGQVKLNQIVFPFKIFYTFYLKGIPDLPRELLCTPVPPSPTVEPTPAAKATEPHACLNNFTN

参考资源:
https://stackoverflow.com/questions/15857088/remove-line-breaks-in-a-fasta-file