How I come up with useful `𝚋𝚊𝚜𝚑` abominations, and how you can too

Fredrick Brennan
5 min readJun 19, 2021

--

Behold, a useful program, which in the course of this article, we will write, and understand.

This program sorts Japanese text in the file lines. Let’s consider the following input of character names from Neon Genesis Evangelion:

To come up with our final abomination, we must come up with some logical steps. So, what is sorting Japanese text?

  • figuring out how to pronounce the kanji;
  • listing the pronunciations in gojūon order (like alphabetical order for kana)
  • putting Latin at the end.¹

¹ This isn’t consistent, sometimes Latin is at the beginning. Let’s do it at the end though since it’s more challenging. 😈

OK, first of all, kanji in Japanese have multiple pronunciations. Luckily, MeCab can return for us roughly what it thinks the pronunciation is based on context. It’s not perfect, but close enough for most lists, and far better than Unicode order.

The flag for this is -Oyomi. Building a Bash abomination is all about using small blocks, and testing everything before you smash blocks together.

So, let’s try two lyrics of popular songs, to make sure that 会 in 会う and in 会話 are properly differentiated:

$ mecab -Oyomi
会話が続かないな。何故だどうしてだ?アホか?
カイワガツヅカナイナナゼダドウシテダ?アホカ?
会いたかった 会いたかった 会いたかった yes!
アイタカッタ アイタカッタ アイタカッタ yes!

Good, much better than we can do on our own. So running it on the whole input……

We certainly notice that MeCab isn’t perfect, especially when it comes to the names of fictional characters which often use strange音読み. However, as long as the first and second kana is correct, the order should be right in most cases for relatively short lists like this one. We see an example of an error in the name of 綾波レイ (Ayanami Rei), interpreted as アヤハレイ (Ayaha Rei).²
² 何❗❓

Next, let’s make it Hiragana, since this is most often used if we’re going to be generating section headers. The Network Kanji Filter can do it. It’s not a hard problem.

$ nkf -h1 -w # h1=katakana→hiragana ; w=UTF-8
バイブル
ばいぶる

Before we move on, we’re at the point in every Bash abomination where we need to start thinking about our input streams and output streams.

Input: unordered Japanese mixed kanji–kana lines, one per line
Intermediary: input-ordered Japanese lines, katakana
Intermediary: input-ordered Japanese lines, hiragana
Intermediary: header kana
Output: header kana, followed by exact input lines, but in Unicode order based on the hiragana output

Clearly, order matters a lot here, it’s a sorting program. We might want to start tagging our lines with numbers. We can do so with awk so we don’t lose track. We need to build a pipeline, sending the output of one command as the input of the other. We need |.

$ mecab -Oyomi | nkf -h1 -w | awk '{print $s "\t" NR}'

会話が続かないな。何故だどうしてだ?アホか?
会いたかった 会いたかった 会いたかった yes!
バイブル
かいわがつづかないな。なぜだどうしてだ?あほか? 1
あいたかった あいたかった あいたかった yes! 2
ばいぶる 3

These numbers can help us figure out what lines of input we need post-ordering. But wait, why at the end? Of course, so sort knows we want kana Unicode order, not numerical order.

$ (mecab -Oyomi | nkf -h1 -w | awk '{print $s "\t" NR}' | sort) < lines2

BX293A PEN2 39
あいだけんすけ 30
あおばしげる 20
あかぎなおこ 37
あかぎりつこ 15
あがのかえで 55
あさりけいた 52
あすかのぎぼ 43
あすかのちち 42
あまぎひとみ 67
あやはれい 2
いかりげんどう 16
いかりしんじ 1
[…]

Note that the Latin is on top, we’ll have to remedy that later. Let’s figure out headers in the meanwhile.

The hard part is going to be gluing this back together, but on face value, just getting the headers is easy:

$ (mecab -Oyomi | nkf -h1 -w | awk '{print substr($s, 1, 1)}' | sort | uniq) < lines2

B










[…]

In order to do the glue, we have to think about this. Essentially, we want to repeat the first character of every line on its own line…

$ (mecab -Oyomi | nkf -h1 -w | awk '{ORS=""; print substr($s, 1, 1); ORS="\n"; print "\n" $s "\t" NR;}') < lines2


いかりしんじ 1

あやはれい 2

そうりゅう・あすか・らんぐれー 3

まきなみ・まり・いらすとりあす 4

とうじょうまでのけいい 5

じんぶつ 6
[…]

After sorting, of course…

$ (mecab -Oyomi | nkf -h1 -w | sort | awk '{ORS=""; print substr($s, 1, 1); ORS="\n"; print "\n" $s "\t" NR;}') < lines2

B
BX293A PEN2 1

あいだけんすけ 2

あおばしげる 3

あかぎなおこ 4

あかぎりつこ 5

あがのかえで 6
[…]

But we want if and only if it hasn’t already appeared. So, we have to store the sorted output and run through it again to add B, ,, et cetera.

$ OUTP=`(mecab -Oyomi | nkf -h1 -w | awk '{ORS="\n"; print $s "\t" NR;}' | sort) < lines2`; echo "$OUTP" | awk '{ if (ff != substr($s, 1, 1)) { ff=substr($s, 1, 1); print ff "\n" $s } else { print $s }}'

B
BX293A PEN2 39

あいだけんすけ 30
あおばしげる 20
あかぎなおこ 37
あかぎりつこ 15
あがのかえで 55
あさりけいた 52
あすかのぎぼ 43
あすかのちち 42
あまぎひとみ 67
あやはれい 2

いかりげんどう 16
いかりしんじ 1
いかりゆい 36
いぶきまや 18
いめーじ 11
[…]

Seems like it’s time to get the original lines back. Let’s read the output into $OUT, so we can go over it one more time to move the Latin, which is totally optional, of course.

$ OUTP=`(mecab -Oyomi | nkf -h1 -w | awk '{ORS="\n"; print $s "\t" NR;}' | sort) < lines2`; OUT=$(echo "$OUTP" | awk -F "\t" '{ if (ff != substr($s, 1, 1)) { ff=substr($s, 1, 1); print ff }; { system("sed "$NF"q\\;d lines2") }}'); echo "$OUT"

B
BX293A PEN2

相田ケンスケ
青葉シゲル
赤木ナオコ
赤木リツコ
阿賀野カエデ
浅利ケイタ
アスカの義母
アスカの父
天城ヒトミ
綾波レイ

碇ゲンドウ
碇シンジ
[…]

Now let’s move the Latin.

$ # OUTP=output—preliminary; OUTI=output—intermediary; OUTF=output—final
$ FN=lines2; OUTP=`(mecab -Oyomi | nkf -h1 -w | awk '{ORS="\n"; print $s "\t" NR;}' | sort) < "$FN"`; OUTI=$(echo "$OUTP" | awk -F "\t" '{ if (ff != substr($s, 1, 1)) { ff=substr($s, 1, 1); print ff };{ system("sed "$NF"q\\;d '"$FN"'") }}');OUTF=$(egrep -v '^[A-Za-z]+' <<< "$OUTI" && egrep '^[A-Za-z]+' <<< "$OUTI"); echo "$OUTF"


相田ケンスケ
青葉シゲル
赤木ナオコ
赤木リツコ
阿賀野カエデ
浅利ケイタ
アスカの義母
アスカの父
天城ヒトミ
綾波レイ

碇ゲンドウ
碇シンジ
碇ユイ
伊吹マヤ
イメージ

英語版での発音

大井サツキ

加賀ヒトミ
加古ナツコ
香椎エリカ
加持リョウジ
加持リョウジ
葛城ヒデアキ
葛城ミサト
[…]

涼波コトネ

老教師
B
BX293A PEN2

And that, my friends, is how you go from simple terminal commands to a totally incomprehensible Bash abomination. Also known as a Makefile.

Not even `bat` can save us.

--

--

No responses yet