Back to DrivenData | Blog

How to aggregate in pandas


#1

Hi…

Any idea how to aggregate hhold and indiv data by “id” after merging the two files. I want to aggregate and use the most frequent values of the features.

Thanks.


#2

You can group the items by the id and then perform operations inside the groups by using the pandas groupby function.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html


#3

pd.merge
pd.concat
pd.join


#4

How would you do it in R?


#5

it’s not streightforward but you can look at dplyr helpfile at join, this is my code:

train_ind_a <- fread("./data/A_indiv_train.csv", stringsAsFactors = TRUE)
test_ind_a <- fread("./data/A_indiv_test.csv", stringsAsFactors = TRUE)

combi_a_indiv <- bind_rows(train_ind_a, test_ind_a)

combi_a_indiv <- subset( combi_a_indiv, select = -c(iid, poor,country ) )
combi_a_indiv <- combi_a_indiv[!duplicated(combi_a_indiv$id), ]

combi_a <- join(combi_a, combi_a_indiv, by=“id”, type=“inner”)