The idea of our project is to find the domain based common words based on the high dimensional represention of the words. In this week I’ve run the continous bag of words CBOW bundle that exist in the ECL to a sample dataset to extract the vector represention of the words in the corpus. I was able to extract the vocabulary from the corpus, extract the high dimensional representation for each word on corpus and find some information about the corpus(vocabulary size, number of occurrence for each term).
In this week I ‘ve started preparing my dataset that I will use in my internship. I’ve collected and download the Pubmed datasets from https://catalog.data.gov/dataset/pubmed. It has more than 26 million citations for biomedical literature from MEDLINE, life science journals, and online books.
In order to use these data set to be a feature of the classification methods. I’ve picked some journals to be used in testing the domain based methodology. Also, it is important to clean the above dataset, so I wrote an ECL code to do that. I’ve read the data in ECL and convert the documents to Record.
Using text vectors bundle (CBOW) in HPCC to find the common words for any datasets. The idea behind using text vectors is it’s ability to map each unique token in the corpus to a vector to discover the relationships between words by analyzing word usage patterns. text vectors maps text words into a high dimensional vector space such that similar words are grouped together, and the distances between words can reveal relationships. In the first week of my internship, I review ECL language, run the HPCC platform on the virtual machine and run different examples in ECL.