Domain Based Common words

Week 10

In this week we wrote python code to test the domain based methodology using the
naïve base classifier. We installed python packages on AWS instances. The accuracy of the classifier before eliminating any words was 0.497. We improved the accuracy after eliminating 10000 words to 0.515, that’s mean we improve the accuracy of the classifier by 2 % as well as decrease the number of features of classifier.

Domain Based Common words

Week 9

  1. In this week we increase the size of the data set we test the domain based methodology on 92349 documents. We applied the ClassificationForest on the above documents we found the accuracy of the classifier before and after eliminating 1000 words from the domain based common words.
  2. Build the ground truth of the classification method by computing
    precision, recall and F1 measure.

Domain Based Common words

Week 8

In this week we use the LearningTrees (Random Forests)  for classification after eliminating a set of domain based common words, and build the ground truth of the classifier before and after eliminating domain based common words.

We used the sentence vectors after applying CBOW as input features to Random Forests, and about 20 % of data reserved for testing.

We convert our data (train and test data) to the form used by the ML bundles.  Then we separate the Independent Variables and the Dependent Variables. Classification expects Dependent variables to be unsigned integers representing discrete class labels.  It therefore expresses Dependent variables using the DiscreteField layout. 

We were able to increase the accuracy by 3% after eliminating 5000 words.

Domain Based Common words

Week 7

In this week we use the LearningTrees (Random Forests)  for classification after eliminating a set of domain based common words.

We used the sentence vectors after applying CBOW as input features to Random Forests, and about 20 % of data reserved for testing.

We convert our data (train and test data) to the form used by the ML bundles.  Then we separate the Independent Variables and the Dependent Variables. Classification expects Dependent variables to be unsigned integers representing discrete class labels.  It therefore expresses Dependent variables using the DiscreteField layout. 

We get the assessmentC record  which contains a Classification_Accuracy record. That record contains the following metrics:

  • RecCount – The number of records tested
  • ErrCount  – The number of mis-classifications
  • Raw_Accuracy – The percentage of records that were correctly classified
  • PoD (Power of Discrimination) – How much better was the classification than choosing one of the classes randomly?
  • PoDE (Extended Power of Discrimination)  – How much better was the classification than always choosing the most common class?

We were able to increase the accuracy by 3% after eliminating 5000 words.



Domain Based Common words

Week 6

In this week we use the LearningTrees (Random Forests)  for classification. We used the Random Forests because it’s can handle large numbers of records and large numbers of fields, and, it scales well on HPCC Systems clusters of almost any size.

We used the sentence vectors after applying CBOW as input features to Random Forests, and about 20 % of data reserved for testing.

We convert our data (train and test data) to the form used by the ML bundles.  Then we separate the Independent Variables and the Dependent Variables. Classification expects Dependent variables to be unsigned integers representing discrete class labels.  It therefore expresses Dependent variables using the DiscreteField layout. 

We get the assessmentC record  which contains a Classification_Accuracy record. That record contains the following metrics:

  • RecCount – the number of records tested
  • ErrCount  – the number of mis-classifications
  • Raw_Accuracy – the percentage of records that were correctly classified

Domain Based Common words

week 4

In this week I’ve Applied text vectors (CBOW) method in PubMed data sets, by represent each unique token in the corpus by hundred dimension I was able to find the center of the words that represented by hundred-dimension vector. I’ve used Euclidian distance to find the distance from center to every unique word in corpus. but Unfortunatly I’ve stucked here I did not get the expected results so I wrote a python code to compare the results between ECL and python. I found that in Python I’ve got what I need so We figured out that there is a problem in applying CBOW in ECL. I’ve met with Kevin and Roger many times to solve the problem and we were able to fix it.

source: https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281

Domain Based Common words

week 3

The idea of our project is to find the domain based common words based on the high dimensional represention of the words. In this week I’ve run the continous bag of words CBOW bundle that exist in the ECL to a sample dataset to extract the vector represention of the words in the corpus. I was able to extract the vocabulary from the corpus, extract the high dimensional representation for each word on corpus and find some information about the corpus(vocabulary size, number of occurrence
for each term).

source: https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281

Domain Based Common words

week 2

In this week I ‘ve started preparing my dataset that I will use in my internship. I’ve collected and download the Pubmed datasets from https://catalog.data.gov/dataset/pubmed. It has more than 26 million citations for biomedical literature from MEDLINE, life science journals, and online books.

In order to use these data set to be a feature of the classification methods. I’ve picked some journals to be used in testing the domain based methodology. Also, it is important to clean the above dataset, so I wrote an ECL code to do that. I’ve read the data in ECL and convert the documents to Record.

Domain Based Common words

week 1

Using text vectors bundle (CBOW) in HPCC to find the common words for any datasets. The idea behind using text vectors is it’s ability to map each unique token in the corpus to a vector to discover the relationships between words by analyzing word usage patterns. text vectors maps text words into a high dimensional vector space such that similar words are grouped together, and the distances between words can reveal relationships. In the first week of my internship, I review ECL language, run the HPCC platform on the virtual machine and run different examples in ECL.