Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st:Failure to detect strings that look completely identical

From	Nicola Man <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: st:Failure to detect strings that look completely identical
Date	Wed, 23 Nov 2011 06:22:06 +0000

Yes, it diagnosed the problem and it is to do with char(160).  You are also correct about substr. Below is my solution in case anyone is following this:

gen C1=substr(Country,1,1)
charlist C1 if Country!=Nation
*showed char(160) as the sole leading ascii character for all non-matches.
ret li

gen N1=substr(Nation,1,1)
charlist N1 if Country!=Nation
*showed alphabets as the leading characters.
ret li

*this fixed the problem
replace Country=substr(Country,2,.) if Country!=Nation
*came back with a sensible list
list Country Nation if Country!=Nation

Thanks again for the help.

Regards,
Nicola
------------------------
The help for -charlist- (SSC) documents that char(32) and char(160)
are hard to tell apart:

. di "|`=char(32)'|"
| |

. di "|`=char(160)'|"
| |

So, watch out for char(160).

Nick

On Tue, Nov 22, 2011 at 12:33 PM, Nick Cox <[email protected]> wrote:
> I didn't suspect otherwise, but I have now confirmed that in Stata 12 too -substr()- will not allow a single argument.
>
> Nick
> [email protected]
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
> Sent: 22 November 2011 08:52
> To: [email protected]
> Subject: Re: st: Failure to detect strings that look completely identical
>
> To mention a side-issue first, I don't know what your -substr()- calls
> are doing. I suspect you have misquoted them. Does -substr()- indulge
> a single argument in Stata 12? I am away from Stata 12 at present.
>
> On the main point: This looks like to me as if you have embedded  tab
> characters, or some other elusive characters.
>
> More generally, a utility -charlist- was written for similar problems
> and may be downloaded from SSC. Pay attention to its returned results.
>
> For example, if I introduce a tab character I can detect it again with
> -charlist-
>
> sysuse auto
> gen make2 = char(9) + make
> l make make2 in 1/10
>
>     +--------------------------------+
>     | make                     make2 |
>     |--------------------------------|
>  1. | AMC Concord              AMC Concord |
>  2. | AMC Pacer                AMC Pacer |
>  3. | AMC Spirit               AMC Spirit |
>  4. | Buick Century    Buick Century |
>  5. | Buick Electra    Buick Electra |
>     |--------------------------------|
>  6. | Buick LeSabre    Buick LeSabre |
>  7. | Buick Opel               Buick Opel |
>  8. | Buick Regal              Buick Regal |
>  9. | Buick Riviera    Buick Riviera |
>  10. | Buick Skylark    Buick Skylark |
>     +--------------------------------+
>
> . charlist make2 in 1/10
>         ABCELMOPRSabcdegiklnoprtuvy
>
> . ret li
>
> macros:
>              r(chars) : "       ABCELMOPRSabcdegiklnoprtuvy"
>           r(sepchars) : "         A B C E L M O P R S a b c d e g i k
> l n o p r t u v y "
>              r(ascii) : "9 32 65 66 67 69 76 77 79 80 82 83 97 98 99
> 100 101 103 105 107 108 .."
>
>  Once you have identified the rogue characters, remove them using the
> -char()- function.
>
> replace strvar = subinstr(strvar, char(9), "", .)
>
> Nick
>
> On Tue, Nov 22, 2011 at 2:46 AM, Nicola Man <[email protected]> wrote:
>>
>> I have a problem with matching strings in two string variables.
>>
>> *Generate C1 and N1 to show problem with detecting the first character in the Country variable
>> . gen C1=substr(trim(Country))
>> . gen N1=substr(trim(Nation))
>> . list C1 N1 Country Nation if trim(Country)!=trim(Nation)
>>
>>     +---------------------------------------------------------------------------+
>>     | C1   N1                           Country                          Nation |
>>     |---------------------------------------------------------------------------|
>>  1. |       A                       Afghanistan                     Afghanistan |
>>  2. |       A                           Albania                         Albania |
>>  3. |       A                           Algeria                         Algeria |
>>  4. |       A                           Andorra                         Andorra |
>>  5. |       A                            Angola                          Angola |
>>     |---------------------------------------------------------------------------|
>>  6. |       A               Antigua and Barbuda               Antigua & Barbuda |
>>  7. | etc..
>>
>> The first five lines of observations for Country and Nation look identical to me, so I am not sure why Stata
>>  is not detecting this. The C1 variable tells me that the first character is not detected correctly even with the trim string function. Looking at it in another way, there were only four records identified as matching with the command below:
>>
>> . lis C1 N1 Country Nation if trim(Country)==trim(Nation)
>>
>>     +-----------------------------------------------+
>>     | C1   N1            Country             Nation |
>>     |-----------------------------------------------|
>> 109. |  M    M   Marshall Islands   Marshall Islands |
>> 121. |  N    N              Nauru              Nauru |
>> 166. |  T    T             Taiwan             Taiwan |
>> 176. |  T    T             Tuvalu             Tuvalu |
>>     +-----------------------------------------------+
>>
>> I am currently using Stata 12 / SE and not exactly sure if this is to do with the character coding system it uses (is it only ASCII?).  If it is to do with the character coding, then I would appreciate advice or suggestions on the way around it.

Thanks,
Nicola

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: st: Re: How to cite a macro in ODBC's SQL select statement
Next by Date: st: Merging datasets with non-unique identifiers
Previous by thread: st: model selection using information criteria with xtlsdvc or xtabond2
Next by thread: st: Merging datasets with non-unique identifiers
Index(es):
- Date
- Thread