有些不标准的fasta文件中序列内部会存在换行符,会影响一些软件对其分析,需要把这些换行符去掉。
例如,有这样一个序列文件:
>LOX-3
MLGGLKDKLTGKNGNKIKGLAVLMSRKLLDPRDFTASLLDNVHE
VFGNSITCQLVSATVADQNNEGRGIVGSEANLEQGLTDLPSVSQGESKLTVRFNWEMD
KHGVPGAIIIKNHHSTKFFLKTITLHDVPGCDTIVFVANSWIYPVGKYHYNRIFFANI
SYPPSQMPEALRPYREDELRYLRGEDRQGPYQEHDRIYRYDVYNDLGEPDRDNPRPVL
GGSQKHPYPRRGRTGRIPTKKDPNSESRLSLLEQIY
>AOC
MAAAAPSRVSVRAAAPGQTGGFAKIRPQVVVAAAARSAGVSGRR
ARSVRASLFSPKPATPKDARPAKVQEMFVYEINERDRESPAYLRLSAKQTENALGDLV
PFTNKLYSGSLDKRLGISAGICILIQHVPERNGDRYEAIYSFYFGDYGHISVQGPYLT
YEESYLAVTGGSGVFEGAYGQVKLNQIVFPFKIFYTFYLKGIPDLPRELLCTPVPPSP
TVEPTPAAKATEPHACLNNFTN
可以使用awk实现这一功能。
awk '!/^>/ { printf "%s", $0; n = "\n" }/^>/ { print n $0; n = "" }END { printf "%s", n }' test.fa > output.fa
查看输出文件,序列内部的换行符已删除:
less -S output.fa
>LOX-3
MLGGLKDKLTGKNGNKIKGLAVLMSRKLLDPRDFTASLLDNVHEVFGNSITCQLVSATVADQNNEGRGIVGSEANLEQGLTDLPSVSQGESKLTVRFNWEMDKHGVPGAIIIKNHHSTKFFLKTITLHDVPGCDTIVFVANSWIYPVGKYHYNRIFFANISYPPSQMPEALRPYREDELRYLRGEDRQGPYQEHDRIYRYDVYNDLGEPDRDNPRPVLGGSQKHPYPRRGRTGRIPTKKDPNSESRLSLLEQIY
>AOC
MAAAAPSRVSVRAAAPGQTGGFAKIRPQVVVAAAARSAGVSGRRARSVRASLFSPKPATPKDARPAKVQEMFVYEINERDRESPAYLRLSAKQTENALGDLVPFTNKLYSGSLDKRLGISAGICILIQHVPERNGDRYEAIYSFYFGDYGHISVQGPYLTYEESYLAVTGGSGVFEGAYGQVKLNQIVFPFKIFYTFYLKGIPDLPRELLCTPVPPSPTVEPTPAAKATEPHACLNNFTN
参考资源:
https://stackoverflow.com/questions/15857088/remove-line-breaks-in-a-fasta-file