![]() ![]() ![]() Create a new global_id for every session_id which cannotīe linked to neither a user_id nor universal_id but has.rows with don't match on neither user_id or universal_id but on session_id Propagate existing values for global_id to all rows with matching session_id, i.e.Propagate values for global_id to all rows with.Create a new global_id for every universal_id whichĬannot be linked to a user_id but has multiple occurrences.There is an arbitrary tie break if multiple universal_ids match on one or more user_ids, where all matching universal_ids are assigned to the same user_id rows which don't match on user_id but match on universal_id. Propagate values for global_id to all rows with matching universal_id, i.e.Create a new global_id for every user_id that has single occurrence (n=1).Propagate values for global_id to all rows with matching.Create a new global_id for every user_id that has.Update: Since you were concerned about the risk of duplicates using fully randomly generated UUID4s, I coded you a little function which allows you to generate a UUID leveraging both UUID1 and/or UUID4 - I personally would not be worried about clashes of UUID4 values whatsoever, but it's up to you. Update: The algorithm now features the arbitrary tie-break when multiple user_ids match multiple universal_ids. Here's my test dataset: import pandas as pdĭf = pd.DataFrame(data, columns=)īased on what you described, we can formulate an algorithm as follows, referring to the new ID as global_id. I have Snowflake and Databricks at my disposal. If you know how this is done please help, or at least point me to a subject that I should research to be able to do this. The idea is to create a map of universal_id:unique_id. ![]() I'm trying to find a thing in python (or pyspark because I may be using this on millions of rows) that can help me do the clustering of this data (or however this process is called in data science). user_id:universal_id = 1:N OR N:1 (if N:1 then each N needs a unique_id).Here's a list of possible relationships between columns: If a new row shows up that matches any of the previous rows (with already calculated unique id) by any of the 3 columns, the already existing unique id should be added to the new row. "id" column is the order in which data is written into the database.When user_id doesn't have a match and universal_id has a match, those should be treated as different (separate unique id).Sometimes user_id is not available, and in that case the other two columns should be used to create the unique id.Column "expected_result" is what this unique_id should be after processing other 3 columns. I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id and session_id). ![]()
0 Comments
Leave a Reply. |