Domain Based Common words

Week 7

In this week we use the LearningTrees (Random Forests)  for classification after eliminating a set of domain based common words.

We used the sentence vectors after applying CBOW as input features to Random Forests, and about 20 % of data reserved for testing.

We convert our data (train and test data) to the form used by the ML bundles.  Then we separate the Independent Variables and the Dependent Variables. Classification expects Dependent variables to be unsigned integers representing discrete class labels.  It therefore expresses Dependent variables using the DiscreteField layout. 

We get the assessmentC record  which contains a Classification_Accuracy record. That record contains the following metrics:

  • RecCount – The number of records tested
  • ErrCount  – The number of mis-classifications
  • Raw_Accuracy – The percentage of records that were correctly classified
  • PoD (Power of Discrimination) – How much better was the classification than choosing one of the classes randomly?
  • PoDE (Extended Power of Discrimination)  – How much better was the classification than always choosing the most common class?

We were able to increase the accuracy by 3% after eliminating 5000 words.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s