UPDATE: I've modified the script to add color and transparency :)
Ushaft's this tweet on creating tag cloud caught my attention. While there are better tools available online, i decided to give it a try in Linux :)
First we need to generate a frequency list of words out of a file. No brainer. I used the GNU/GPL license as source text.
cat /usr/share/common-licenses/GPL |
tr 'A-Z' 'a-z' |
sed 's/[^a-z ]//g' |
sed 's/\s/\n/g' |
awk '{if (length($1)>=3) print $1 }' |
sort | uniq -c | sort -nr > freq.txt
Explanation (per line):
1. Print the license
2. Convert to lowercase
3. Remove punctuations
4. One word per line
5. Discard words with 2 or less long
6. Grab frequency and write to freq.txt
freq.txt needs some editing to remove irrelevant words like "the", "but", etc.
Now to create tag cloud, i wrote a simple awk script that outputs svg file. Saved this as gentagcloud.awk (Please change the width, height and scale variable to your taste)
#!/bin/awk -f
BEGIN {
WIDTH=1000
HEIGHT=600
SCALE=1
OFS=""
print "<?xml version=\"1.0\"
encoding=\"UTF-8\" standalone=\"no\"?>"
print "<svg width=\"",WIDTH,"\" height=\"",
HEIGHT,"\" version=\"1.1\"
xmlns=\"http://www.w3.org/2000/svg\">"
}
{
R = int(rand()*9)
G = int(rand()*9)
B = int(rand()*9)
print "<text style=\"fill:#",R,G,B,";opacity:0.75;
font-size:",$1*SCALE,"px;\"
x=\"",rand()*(WIDTH-100),"\"
y=\"",rand()*HEIGHT,"\">",$2,"</text>"
}
END{ print "</svg>" }
Now to create tag cloud,
awk -f gentagcloud.awk < freq.txt > tagcloud.svg
Tada! Tag cloud generated, view tagcloud.svg in your browser or image viewer
If you have ImageMagick installed, you can convert it directly to png too. (Transparency isnt handled properly. If you have workarounds, please comment)
awk -f gentagcloud.awk < freq.txt | convert - tagcloud.png
BTW, the tag cloud was like this: (lil' ugly, but WTH :))