In addition, "space" for -egen, sieve()- means " " and
doesn't include any characters that just print as spaces because
they're otherwise unprintable.
As always, you may be better with knitting a solution
in exactly the same size as your problem.
Given a string variable -badid-, let's suppose we regard
legitimate characters to be a-z A-Z 0-9.
gen goodid = ""
gen length = length(badid)
su length, meanonly
qui forval i = 1/`r(max)' {
replace goodid = goodid + substr(badid,`i',1) ///
if inrange(substr(badid,`i',1),"a","z") | ///
inrange(substr(badid,`i',1),"A","Z") | ///
inrange(substr(badid,`i',1),"0","9")
}
drop length
The recipe appears fairly general: just tune the -if-
condition. My guess is that
stuff you want to keep will always be printable and
fall into a few small classes.
Two small morals are that we do not need to fool around
with -char()- or its elusive inverse -ascii()-,
and that -inrange()- applies to strings too.
. di inrange("Bush","Lincoln","Roosevelt")
0
Isn't Stata well-informed as well as smart?
Nick
[email protected]
Nick Cox
> -omit(space)- confuses syntaxes and will not
> do what you think it will. It omits "s", "p", etc.
Fred Wolfe
> > That is a great egen. But it doesn't seem to work
> completely to omit
> > HEX(A0), unless I have done something wrong. Always likely.
> >
> >
> > . use fwbids,clear
> > . egen apatkey2 = sieve(apatkey), keep(a n o)
> > . gen l1 = length(apatkey)
> > . gen l2 = length(apatkey2)
> >
> > . egen apatkey3 = sieve(apatkey2), omit(space)
> > . gen l3 = length(apatkey3)
> >
> > . egen apatkey4 = sieve(apatkey3), keep(a n)
> > . gen l4 = length(apatkey4)
> >
> >
> > +-------------------------------------------------------------
> > --------------------------+
> > | apatkey greger apatkey2 l1 l2
> > apatkey3 l3
> > apatkey4 l4 |
> >
> > |-------------------------------------------------------------
> > --------------------------|
> > 1. |
> > ABI000000-01 1 ABI000000-01 12 12 ABI000000-01 12
> > ABI00000001 11 |
> > 2. |
> > AHR000000 1 AHR000000 12 11 AHR000000 11
> > AHR000000 9 |
> > 3. |
> > AHR360227 1 AHR360227 12 11 AHR360227 11
> > AHR360227 9 |
> > 4. |
> > ALB431118 1 ALB431118 12 11 ALB431118 11
> > ALB431118 9 |
> > 5. |
> > ALD771122 1 ALD771122 12 11 ALD771122 11
> > ALD771122 9 |
> >
> > |-------------------------------------------------------------
> > --------------------------|
> >
> >
> >
> >
> > At 10:13 AM 8/25/2006, Nick Cox wrote:
> > >"you" here presumably meaning Fred's collaborators.
> > >
> > >There is a home-grown -egen- function called -sieve()-
> > >in -egenmore- from SSC that could be used to keep
> > >alphanumeric characters only.
> > >
> > >Nick
> > >[email protected]
> > >
> > >Rafal Raciborski
> > >
> > > > you could also use the clean() function in excel first,
> > which removes
> > > > all nonprintable characters, before pasting into stata.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/