Friday, November 12, 2010

Generating Tag Cloud the Unix way

UPDATE: I've modified the script to add color and transparency :)

Ushaft's this tweet on creating tag cloud caught my attention. While there are better tools available online, i decided to give it a try in Linux :)

First we need to generate a frequency list of words out of a file. No brainer. I used the GNU/GPL license as source text.

cat /usr/share/common-licenses/GPL |
tr 'A-Z' 'a-z' |
sed 's/[^a-z ]//g' |
sed 's/\s/\n/g' |
awk '{if (length($1)>=3) print $1 }' |
sort | uniq -c | sort -nr > freq.txt

Explanation (per line):
1. Print the license
2. Convert to lowercase
3. Remove punctuations
4. One word per line
5. Discard words with 2 or less long
6. Grab frequency and write to freq.txt

freq.txt needs some editing to remove irrelevant words like "the", "but", etc.

Now to create tag cloud, i wrote a simple awk script that outputs svg file. Saved this as gentagcloud.awk (Please change the width, height and scale variable to your taste)

#!/bin/awk -f



print "<?xml version=\"1.0\"
encoding=\"UTF-8\" standalone=\"no\"?>"
print "<svg width=\"",WIDTH,"\" height=\"",
HEIGHT,"\" version=\"1.1\"

R = int(rand()*9)
G = int(rand()*9)
B = int(rand()*9)

print "<text style=\"fill:#",R,G,B,";opacity:0.75;

END{ print "</svg>" }

Now to create tag cloud,

awk -f gentagcloud.awk < freq.txt > tagcloud.svg

Tada! Tag cloud generated, view tagcloud.svg in your browser or image viewer

If you have ImageMagick installed, you can convert it directly to png too. (Transparency isnt handled properly. If you have workarounds, please comment)

awk -f gentagcloud.awk < freq.txt | convert - tagcloud.png

BTW, the tag cloud was like this: (lil' ugly, but WTH :))