Skip to main content

Posts

Showing posts from January, 2021

Urdu News Classification

News Classification is the latest buzz word in NLP for identifying the type of news and figuring out its a fake or not. There is a dataset available Urdu News extracted from web and has multiple classes and can be used for news classificaiton and other purposes. Preprocessing: News dataset is in multiple excel files, for the sake of classification, we need to convert it to single csv file. Here is how I did it import pandas as pd import glob files = glob.glob( "data/*.xlsx" ) df = pd.DataFrame() # if you want to use xlrd then 1.2.0 is good to go, openpyxl has a lot of issues. for file in files: excel_file = pd.read_excel(file , index_col= None , na_values=[ 'NA' ] , usecols=[ "category" , "summery" , "title" ] , engine= "xlrd" ) df = df.append(excel_file , ignore_index= True ) df.drop_duplicates(inplace= True ) # use single word for classification. df.category = df.category.str.replace( "weird ne...