Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Dropping Alphanumeric elements from variables
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Dropping Alphanumeric elements from variables
Date
Thu, 7 Feb 2013 14:43:43 +0000
That is, it sounds as if
egen newid = group(a_match1 n_match2), label
could work well for you. For more explanation, please see the 2007
paper whose URL is given below.
Nick
On Thu, Feb 7, 2013 at 2:27 PM, Nick Cox <[email protected]> wrote:
> You could parse your identifiers using -substr()- to split out parts.
> I've found many times that people underestimate the possibilities of
> the very simplest string functions. There is a tutorial on functions
> often neglected in a 2011 paper
>
> http://www.stata-journal.com/article.html?article=dm0058
>
> Or you could use regular expression tools. -moss- from SSC could work
> with examples like this:
>
> . l
>
> +-------------+
> | id |
> |-------------|
> 1. | BTG09A00001 |
> 2. | BTG10A00001 |
> 3. | BGM09A00027 |
> 4. | BGM10A00027 |
> +-------------+
>
> . moss id, match("([0-9]+)") regex prefix(n_)
>
> . moss id, match("([A-Z]+)") regex prefix(a_)
>
> . l id *match*
>
> +---------------------------------------------------------+
> | id n_match1 n_match2 a_match1 a_match2 |
> |---------------------------------------------------------|
> 1. | BTG09A00001 09 00001 BTG A |
> 2. | BTG10A00001 10 00001 BTG A |
> 3. | BGM09A00027 09 00027 BGM A |
> 4. | BGM10A00027 10 00027 BGM A |
> +---------------------------------------------------------+
>
> That's split the identifiers into alphabetic and numeric sequences. I
> took your examples literally in producing these commands. In your
> case, you don't care about the result of -a_match2- but I left in
> above to show that -moss- can split out two or more components, not
> just one as is typically of calls to -substr()-.
>
> That said, Stata makes it easier to create identifiers that will work
> well across Stata's commands. Do-it-yourself identifiers can just make
> tables and graphs unwieldy.
>
> For a 2007 review, see
>
> http://www.stata-journal.com/article.html?article=dm0034
>
> The .pdf for that is accessible to all at
>
> http://www.stata-journal.com/sjpdf.html?articlenum=dm0034
>
> Nick
>
> On Thu, Feb 7, 2013 at 1:37 PM, Michler, Jeffrey D <[email protected]> wrote:
>
>> I have a dataset which includes household ID variables in an alphanumeric format. The letters are abbreviations of the village a household comes from. In addition to being in an alphanumeric format, the HH ID has a year element so that the HH ID for 2010 is slightly different than it was for 2009. I am looking to convert the alphanumeric HH id into a unique id for constructing a panel. I need to replace the 3 letter village abbreviations with a 3 digit number plus I need to drop the year id.
>>
>> An example may clarify. Right now HH IDs look like BTG09A00001, BTG10A00001, BGM09A00027, BGM10A00027.
>>
>> I want to replace the village code (BTG, BGM) with a numerical sequence. I also want to drop the year sequence (09, 10) so that HH ID is consistent for the HH across years, and I want to drop the A, which plays to role in my dataset. Ideally, this would compress the 4 HH ID I gave as examples into just 2 IDs that would look like 10100001 and 10200027.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/