In this week I ‘ve started preparing my dataset that I will use in my internship. I’ve collected and download the Pubmed datasets from https://catalog.data.gov/dataset/pubmed. It has more than 26 million citations for biomedical literature from MEDLINE, life science journals, and online books.
In order to use these data set to be a feature of the classification methods. I’ve picked some journals to be used in testing the domain based methodology. Also, it is important to clean the above dataset, so I wrote an ECL code to do that. I’ve read the data in ECL and convert the documents to Record.