UPDATE: I've modified the script to add color and transparency :)
Ushaft's this tweet on creating tag cloud caught my attention. While there are better tools available online, i decided to give it a try in Linux :)
First we need to generate a frequency list of words out of a file. No brainer. I used the GNU/GPL license as source text.
cat /usr/share/common-licenses/GPL |
tr 'A-Z' 'a-z' |
sed 's/[^a-z ]//g' |
sed 's/\s/\n/g' |
awk '{if (length($1)>=3) print $1 }' |
sort | uniq -c | sort -nr > freq.txt
Explanation (per line):
1. Print the license
2. Convert to lowercase
3. Remove punctuations
4. One word per line
5. Discard words with 2 or less long
6. Grab frequency and write to freq.txt
freq.txt needs some editing to remove irrelevant words like "the", "but", etc.
Now to create tag cloud, i wrote a simple awk script that outputs svg file. Saved this as gentagcloud.awk (Please change the width, height and scale variable to your taste)
#!/bin/awk -f
BEGIN {
WIDTH=1000
HEIGHT=600
SCALE=1
OFS=""
print "<?xml version=\"1.0\"
encoding=\"UTF-8\" standalone=\"no\"?>"
print "<svg width=\"",WIDTH,"\" height=\"",
HEIGHT,"\" version=\"1.1\"
xmlns=\"http://www.w3.org/2000/svg\">"
}
{
R = int(rand()*9)
G = int(rand()*9)
B = int(rand()*9)
print "<text style=\"fill:#",R,G,B,";opacity:0.75;
font-size:",$1*SCALE,"px;\"
x=\"",rand()*(WIDTH-100),"\"
y=\"",rand()*HEIGHT,"\">",$2,"</text>"
}
END{ print "</svg>" }
Now to create tag cloud,
awk -f gentagcloud.awk < freq.txt > tagcloud.svg
Tada! Tag cloud generated, view tagcloud.svg in your browser or image viewer
If you have ImageMagick installed, you can convert it directly to png too. (Transparency isnt handled properly. If you have workarounds, please comment)
awk -f gentagcloud.awk < freq.txt | convert - tagcloud.png
BTW, the tag cloud was like this: (lil' ugly, but WTH :))
Neat stuff jwala !
ReplyDeleteI like the simplicity of the script and brilliant how it occurred to you. I was thinking of stupid solutions like gnuplot, some custom xml schema and blah blah, while plain old' and elegant SVG never came to my mind :)
Impressed.
-Undershaft
Good stuff, that is very little code to generate tag clouds.
ReplyDeleteThe best tool online, IMHO, is http://wordle.net
--p
Not a bad server but it would be better if you-have-had information about some of the book Algorithms for example Knuth and Wirth.
ReplyDelete