Zipf's law and natural languages ~ Robert Gawron

If we count the appearance of words in a sample of (most) human languages, it's visible that they have the Zipf's distribution. It can be used to distinguish human languages (and humans) from texts generated randomly (by spambots). This is presented on below histogram:

Below I will present tools that I made to verify this, first of them is a C++ program used to parse a text and generate a distribution of words that he encountered, second is a R script used to generate diagram from mentioned distribution.

C++ parser:

#include <iostream>
#include <string>
#include <algorithm>
#include <fstream>
#include <sstream>
#include <vector>
#include <map>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int argc, char* argv[])
{
    if (2 != argc)
    {
        cout << "usage: " << argv[0] << " filename" <<endl;
        return EXIT_SUCCESS;
    }

    // read whole file into string
    ifstream t(argv[1]);
    stringstream fileBuffer;
    fileBuffer << t.rdbuf();
    string text = fileBuffer.str();
    // make content of file lower case
    transform(text.begin(), text.end(), text.begin(), ::tolower);

    // create hash, where key = word, value = amount of this word in text
    char_separator<char> sep(" \t\n-;.,");
    tokenizer< char_separator<char> > tokens(text, sep);
    map<string, unsigned> words;
    typedef std::pair<string, unsigned> wordPairType;

    BOOST_FOREACH (string t, tokens)
    {
        bool isIn = words.find(t) != words.end();
        words[t] = isIn ? words[t] + 1 : 1;
    }

    // create vector with amounts of all words in text
    vector<unsigned> distribution;
    BOOST_FOREACH (wordPairType t, words) {
        distribution.push_back(t.second);
    }

    // amounts of words needs to be sorted
    sort(distribution.rbegin(), distribution.rend());

    // show results
    BOOST_FOREACH (unsigned i, distribution)
    {
        cout << i << endl;
    }

    return 0;
}

R script

args <- commandArgs(TRUE)
sizes <- scan(args[1])

png(filename = "results.png", height = 500, width = 700, bg = "white")
plot(sizes, xlab = "words", ylab = "occurences", type="l",log="yx" )

Assuming that program was compiled to ./a.out, R script was saved as chart.r, sample was named as test4.txt, execution of below command should save the histogram to results.png.

bash-3.2$ ./a.out test4.txt > results.txt && Rscript chart.r results.txt
Read 314 items

Robert Gawron

Zipf's law and natural languages

0 commentaires:

Post a Comment