Google
 

Friday, November 12, 2010

Generating Tag Cloud the Unix way

UPDATE: I've modified the script to add color and transparency :)

Ushaft's this tweet on creating tag cloud caught my attention. While there are better tools available online, i decided to give it a try in Linux :)

First we need to generate a frequency list of words out of a file. No brainer. I used the GNU/GPL license as source text.

cat /usr/share/common-licenses/GPL |
tr 'A-Z' 'a-z' |
sed 's/[^a-z ]//g' |
sed 's/\s/\n/g' |
awk '{if (length($1)>=3) print $1 }' |
sort | uniq -c | sort -nr > freq.txt

Explanation (per line):
1. Print the license
2. Convert to lowercase
3. Remove punctuations
4. One word per line
5. Discard words with 2 or less long
6. Grab frequency and write to freq.txt

freq.txt needs some editing to remove irrelevant words like "the", "but", etc.

Now to create tag cloud, i wrote a simple awk script that outputs svg file. Saved this as gentagcloud.awk (Please change the width, height and scale variable to your taste)

#!/bin/awk -f

BEGIN {
WIDTH=1000
HEIGHT=600
SCALE=1

OFS=""

print "<?xml version=\"1.0\"
encoding=\"UTF-8\" standalone=\"no\"?>"
print "<svg width=\"",WIDTH,"\" height=\"",
HEIGHT,"\" version=\"1.1\"
xmlns=\"http://www.w3.org/2000/svg\">"
}

{
R = int(rand()*9)
G = int(rand()*9)
B = int(rand()*9)

print "<text style=\"fill:#",R,G,B,";opacity:0.75;
font-size:",$1*SCALE,"px;\"
x=\"",rand()*(WIDTH-100),"\"
y=\"",rand()*HEIGHT,"\">",$2,"</text>"
}

END{ print "</svg>" }

Now to create tag cloud,

awk -f gentagcloud.awk < freq.txt > tagcloud.svg

Tada! Tag cloud generated, view tagcloud.svg in your browser or image viewer

If you have ImageMagick installed, you can convert it directly to png too. (Transparency isnt handled properly. If you have workarounds, please comment)

awk -f gentagcloud.awk < freq.txt | convert - tagcloud.png

BTW, the tag cloud was like this: (lil' ugly, but WTH :))

3 comments:

  1. Neat stuff jwala !
    I like the simplicity of the script and brilliant how it occurred to you. I was thinking of stupid solutions like gnuplot, some custom xml schema and blah blah, while plain old' and elegant SVG never came to my mind :)

    Impressed.

    -Undershaft

    ReplyDelete
  2. Good stuff, that is very little code to generate tag clouds.
    The best tool online, IMHO, is http://wordle.net

    --p

    ReplyDelete
  3. Not a bad server but it would be better if you-have-had information about some of the book Algorithms for example Knuth and Wirth.

    ReplyDelete