Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: string variables

From	Estrella Gomez <[email protected]>
To	[email protected]
Subject	Re: st: RE: string variables
Date	Fri, 20 Sep 2013 15:51:29 +0200

The problem is that I cannot split my dataset into two parts (the
original version of the movie and the rest) since the ids are mixed.
This is another example:

itunes_id country artist trackname
647263958 | it | Adam Brooks | Certamente, Forse
647263958 | bg | Adam Brooks | Definitely, Maybe
281009584 | cy | Adam Brooks | Definitely, Maybe

Here I have the same ids but different titles in the first two cases
and same titles but different ids in the last rows. This is because
sometimes a translated movie has the same title than the original.

Thank you very much,
Estrella

2013/9/20 Joe Canner <[email protected]>:
> Estella,
>
> I wouldn't assume that the first -n- characters of a movie title are always going to be the same in different languages.  That works for the example you provided, but there will probably be many exceptions.
>
> What you really need--and even this won't work in all cases--is "fuzzy" matching, akin to what is used, for example, by businesses to match the address you enter with a standard address in a database, or when trying to match patient information with a death index.
>
> There are two user-written programs (and there may be more), for things like this: -reclink-, and -vmatch-.  I haven't used them much so I can't say exactly how you would use them for your situation.  If you get stuck on how to manipulate your data to get it into the right structure, let us know.
>
> Of course, the best solution would be if there were an interface with Google Translate, as there is with Google Maps.  I did a quick search and couldn't find anything like this, which seems like it would be very useful in certain situations.  On the other hand, even if there was such a thing, you would end up with the opposite problem: some words would get translated that should not be (e.g., "Anchorman" in your example).
>
> Good luck!
>
> Joe Canner
> Johns Hopkins University School of Medicine
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Estrella Gomez
> Sent: Friday, September 20, 2013 6:12 AM
> To: [email protected]
> Subject: st: string variables
>
> Dear statalisters
>
> I am working on a dataset related to movies. I would like to identify each movie with an unique id. However, there are many cases in which the title is translated and then the original identifier provided in the dataset is not the same, for instance:
>
> id | country | artist | trackname
> 2975 | at | Adam McKay | Anchorman - Die Legende von Ron Burgundy
> 2975 | de | Adam McKay | Anchorman - Die Legende von Ron Burgundy
> 6647 | it | Adam McKay | Anchorman: La leggenda di Ron Burgundy
> 6653 | be | Adam McKay | Anchorman: The Legend of Ron Burgundy
>
> How could I create a new id to uniquely identify the same movie (even if it's in different languages)? Maybe I could use the first 5 or 6 letters in the title, because usually this coincides in different languages; but still I don't know how to do it.
>
> Thanks a lot,
> Estrella Gomez
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: string variables
  - From: Robert Picard <[email protected]>
- RE: st: RE: string variables
  - From: Joe Canner <[email protected]>

References:
- st: string variables
  - From: Estrella Gomez <[email protected]>
- st: RE: string variables
  - From: Joe Canner <[email protected]>

Prev by Date: st: bivariate probit model with a 1st lag of LHS of both quations
Next by Date: RE: st: RE: string variables
Previous by thread: st: RE: string variables
Next by thread: RE: st: RE: string variables
Index(es):
- Date
- Thread