hadoop - SQL/HQL Count Total Not Working on Join -


i'm creating recommender using simple probability formulas. i'm working million song dataset large dataset (189m rows , 3 columns) hadoop cluster, using hive. here sample of initial dataset:

million_song - initial dataset
 user   song    play_count  c3fb2  soxoq   18  c3fb2  sozvc   1  6041e  sobhn   3  6041e  sobxt   5 

but first created table initial dataset, sample show here:

million_both - adds occurrence of 2 songs listened same user
 driver_song_id     also_song_id    play_count  soxoq              sobxt           1642  sobhn              sobxt           2168  sobxt              sozvc           1742 
distinct users (i want value displayed in rows in fifth column of final output)
 select count(distinct users) million_song;  # 139,738,054 
i'm trying add these 2 tables above together, , issue can't correct count_n_users, should same # each row in table. here code:
 select a.driver_song_id, a.also_song_id, a.play_count,   count(distinct b.user), sum(distinct b.user) count_n_users   (select driver_song_id, also_song_id, play_count million_both)   join (select user, song, play_count expp_team.million_song) b   on a.driver_song_id = b.song  group a.driver_song_id, a.also_song_id, a.play_count; 

the issue cannot correct count_n_users, of right it's overstating value.

this i'd output like:

 driver_song_id     also_song_id    play_count  c(b.user)  count_n_users  soxoq              sobxt           1643        463         139,738,054             sobhn              sobxt           2168        483         139,738,054  sobxt              sozvc           1742        725         139,738,054 

if query returning of correct output except column , column should have same value on rows of result set, , can returned running query counts distinct users million_song table can add cross join query grabs total count. avoids issue group by you're experiencing.

select a.driver_song_id,        a.also_song_id,        a.play_count,        count(distinct b.user),        c.count_n_users   (select driver_song_id,                also_song_id,                play_count           million_both)   join (select user,                song,                play_count           expp_team.million_song) b     on a.driver_song_id = b.song  cross join (select count(distinct users) count_n_users million_song) c  group a.driver_song_id,           a.also_song_id,           a.play_count; 

Comments

Popular posts from this blog

asp.net mvc - SSO between MVCForum and Umbraco7 -

Python Tkinter keyboard using bind -

ubuntu - Selenium Node Not Connecting to Hub, Not Opening Port -