Domain Based Common words

Week 10

In this week we wrote python code to test the domain based methodology using the
naïve base classifier. We installed python packages on AWS instances. The accuracy of the classifier before eliminating any words was 0.497. We improved the accuracy after eliminating 10000 words to 0.515, that’s mean we improve the accuracy of the classifier by 2 % as well as decrease the number of features of classifier.

Domain Based Common words

Week 9

  1. In this week we increase the size of the data set we test the domain based methodology on 92349 documents. We applied the ClassificationForest on the above documents we found the accuracy of the classifier before and after eliminating 1000 words from the domain based common words.
  2. Build the ground truth of the classification method by computing
    precision, recall and F1 measure.

Domain Based Common words

Week 8

In this week we use the LearningTrees (Random Forests)  for classification after eliminating a set of domain based common words, and build the ground truth of the classifier before and after eliminating domain based common words.

We used the sentence vectors after applying CBOW as input features to Random Forests, and about 20 % of data reserved for testing.

We convert our data (train and test data) to the form used by the ML bundles.  Then we separate the Independent Variables and the Dependent Variables. Classification expects Dependent variables to be unsigned integers representing discrete class labels.  It therefore expresses Dependent variables using the DiscreteField layout. 

We were able to increase the accuracy by 3% after eliminating 5000 words.

Domain Based Common words

Week 7

In this week we use the LearningTrees (Random Forests)  for classification after eliminating a set of domain based common words.

We used the sentence vectors after applying CBOW as input features to Random Forests, and about 20 % of data reserved for testing.

We convert our data (train and test data) to the form used by the ML bundles.  Then we separate the Independent Variables and the Dependent Variables. Classification expects Dependent variables to be unsigned integers representing discrete class labels.  It therefore expresses Dependent variables using the DiscreteField layout. 

We get the assessmentC record  which contains a Classification_Accuracy record. That record contains the following metrics:

  • RecCount – The number of records tested
  • ErrCount  – The number of mis-classifications
  • Raw_Accuracy – The percentage of records that were correctly classified
  • PoD (Power of Discrimination) – How much better was the classification than choosing one of the classes randomly?
  • PoDE (Extended Power of Discrimination)  – How much better was the classification than always choosing the most common class?

We were able to increase the accuracy by 3% after eliminating 5000 words.



Domain Based Common words

Week 6

In this week we use the LearningTrees (Random Forests)  for classification. We used the Random Forests because it’s can handle large numbers of records and large numbers of fields, and, it scales well on HPCC Systems clusters of almost any size.

We used the sentence vectors after applying CBOW as input features to Random Forests, and about 20 % of data reserved for testing.

We convert our data (train and test data) to the form used by the ML bundles.  Then we separate the Independent Variables and the Dependent Variables. Classification expects Dependent variables to be unsigned integers representing discrete class labels.  It therefore expresses Dependent variables using the DiscreteField layout. 

We get the assessmentC record  which contains a Classification_Accuracy record. That record contains the following metrics:

  • RecCount – the number of records tested
  • ErrCount  – the number of mis-classifications
  • Raw_Accuracy – the percentage of records that were correctly classified

Domain Based Common words

week 4

In this week I’ve Applied text vectors (CBOW) method in PubMed data sets, by represent each unique token in the corpus by hundred dimension I was able to find the center of the words that represented by hundred-dimension vector. I’ve used Euclidian distance to find the distance from center to every unique word in corpus. but Unfortunatly I’ve stucked here I did not get the expected results so I wrote a python code to compare the results between ECL and python. I found that in Python I’ve got what I need so We figured out that there is a problem in applying CBOW in ECL. I’ve met with Kevin and Roger many times to solve the problem and we were able to fix it.

source: https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281