Child Item

Offtopic: To Check or Not to check for duplication. (DBDatastores)

This is purely off topic. Not dropbox related at all, but really this is a sync design question. (if this is against the rules, My apologies to the mods).

I've been struggling with an idea to avoid saving duplicate DBRecords.

We know that record.recordID is the main identifier. You can create it once in its life time and Never change it. (Obviously, why would anyone?)

Each DBRecord has a text. record["text"], All incoming DBRecords deposit these records in my local container. Using its recordID as the identifier.

At some point the User:

Delinks Dropbox
Links Dropbox

My understanding is that delinking will delete all cache (which is good). Newly relinked DBDatastores will try to send all those records.

It is here that I don't really know whats the right move here, thats in line with the end-user's expectation.

So this is what I planned.

Upon Incoming Datastores and its Records.

Check whether I have any local data at all. If I don't, assume its a new app and/or device installation. Add all DBRecords. Check for Nothing
I do have some local data. The next move is:
- Add all DBRecords as if they're NEW OR **
- Check for Duplication, only add those that are completely new.

Both doable. But now I'm thinking to myself, what If the user intentionally wants duplicated entries. To make modifications later. Finally, here i arrive at a conclusion that I give users a choice. Same entires found. Continue to add them as new or Cancel?

So my question is: Is this unnecessary complexity? Should I be even bothering about this? It bothers me because I don't like to have these duplications. I use this app myself. I don't like deleting them later.

How did the Dropbox engineers handle sync (localdatastores transfer etc..)

And on a side note, is MD5 still fast enough for simple text check? Or I should look at SHA or something

Thanks

Find more posts tagged with

API

Abuse

Comments

Steve M.

You may want to take a look at https://www.dropbox.com/developers/blog/84/initializing-data-in-datastores-with-getorinsert, which shows how to use getOrInsert to avoid duplicates. The basic idea is that you need to model your data in such a way that if two records are considered "duplicates," then they'll have the same record ID. Then getOrInsert will take care of avoiding duplicates altogether, which lets you stop worrying about it.

Raheel S.

Hey Steve!

Yea I fully utilise getOrInsert. It works okay. The above scenario is not really about Datastore handling but After I do receive those DBRecords. In essence its about handling duplication locally. I don't use LocalDatastores.

Heres what I do.

I have a local db where I store recordID and the text. ( yes i don't use localDatastores, aah, for some reasons).

For Incoming DBRecord: (At Sync)

I check if recordID already exists, if yes, I update the localDb.record.text = record[@"text"]. If it does not exist, I create a New row in my localDB with that recordID and text.
This is fine.
The uniqueness of a DBRecord is the record["text"]; thats it.

I save the recordID and text field on the localDB.

Ok great.

Now, I open up This very same DBRecord. I want to update the text of this very DBRecord. I don't want to create a new one. I want to update this DBRecord. I already know its recordID, because I stored it on my localDB.

I get that DBRecord using getOrInsert, and I update the record["text1"] field (text --> text1). And sure enough the record is updated with new data. BUT, it is here that the relationship between recordID and the text field has become invalid. When this DBRecord was created`, the recordID == md5("text"). NOW: md5("text1") != recordID.

Next time, When I create a new record with text1 (like importing from the same file) that is equal to the text1 from above; I potentially created a duplicate DBRecord.

I know that text1 exists already in a DBRecord.
But I cannot avoid duplication by looking for md5(text1).
that DBRecord has an Older md5(text), but text was later updated to text1.

I want to be able to give my self a choice to avoid this kind of duplication. What would you guys do here?

Its way early morning right now. I hope what i wrote is understandable. This isn't a datastore problem. Its my problem

Steve M.

Do you consider two records with the same ID but different text to be "the same" or not? (Is it a "duplicate" for both to exist?)

If the answer is "yes," then I think what you're doing should work well already. If the answer is "no," then you need to change your data model. (You should probably model editing an existing record as a delete and then an insert of a new record.)

I don't think anyone but you will be able to answer this question, since we don't know what your app is or does. :-)