Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Failure to detect strings that look completely identical
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Failure to detect strings that look completely identical
Date
Tue, 22 Nov 2011 08:52:06 +0000
To mention a side-issue first, I don't know what your -substr()- calls
are doing. I suspect you have misquoted them. Does -substr()- indulge
a single argument in Stata 12? I am away from Stata 12 at present.
On the main point: This looks like to me as if you have embedded tab
characters, or some other elusive characters.
More generally, a utility -charlist- was written for similar problems
and may be downloaded from SSC. Pay attention to its returned results.
For example, if I introduce a tab character I can detect it again with
-charlist-
sysuse auto
gen make2 = char(9) + make
l make make2 in 1/10
+--------------------------------+
| make make2 |
|--------------------------------|
1. | AMC Concord AMC Concord |
2. | AMC Pacer AMC Pacer |
3. | AMC Spirit AMC Spirit |
4. | Buick Century Buick Century |
5. | Buick Electra Buick Electra |
|--------------------------------|
6. | Buick LeSabre Buick LeSabre |
7. | Buick Opel Buick Opel |
8. | Buick Regal Buick Regal |
9. | Buick Riviera Buick Riviera |
10. | Buick Skylark Buick Skylark |
+--------------------------------+
. charlist make2 in 1/10
ABCELMOPRSabcdegiklnoprtuvy
. ret li
macros:
r(chars) : " ABCELMOPRSabcdegiklnoprtuvy"
r(sepchars) : " A B C E L M O P R S a b c d e g i k
l n o p r t u v y "
r(ascii) : "9 32 65 66 67 69 76 77 79 80 82 83 97 98 99
100 101 103 105 107 108 .."
Once you have identified the rogue characters, remove them using the
-char()- function.
replace strvar = subinstr(strvar, char(9), "", .)
Nick
On Tue, Nov 22, 2011 at 2:46 AM, Nicola Man <[email protected]> wrote:
>
> I have a problem with matching strings in two string variables.
>
> *Generate C1 and N1 to show problem with detecting the first character in the Country variable
> . gen C1=substr(trim(Country))
> . gen N1=substr(trim(Nation))
> . list C1 N1 Country Nation if trim(Country)!=trim(Nation)
>
> +---------------------------------------------------------------------------+
> | C1 N1 Country Nation |
> |---------------------------------------------------------------------------|
> 1. | A Afghanistan Afghanistan |
> 2. | A Albania Albania |
> 3. | A Algeria Algeria |
> 4. | A Andorra Andorra |
> 5. | A Angola Angola |
> |---------------------------------------------------------------------------|
> 6. | A Antigua and Barbuda Antigua & Barbuda |
> 7. | etc..
>
> The first five lines of observations for Country and Nation look identical to me, so I am not sure why Stata
> is not detecting this. The C1 variable tells me that the first character is not detected correctly even with the trim string function. Looking at it in another way, there were only four records identified as matching with the command below:
>
> . lis C1 N1 Country Nation if trim(Country)==trim(Nation)
>
> +-----------------------------------------------+
> | C1 N1 Country Nation |
> |-----------------------------------------------|
> 109. | M M Marshall Islands Marshall Islands |
> 121. | N N Nauru Nauru |
> 166. | T T Taiwan Taiwan |
> 176. | T T Tuvalu Tuvalu |
> +-----------------------------------------------+
>
> I am currently using Stata 12 / SE and not exactly sure if this is to do with the character coding system it uses (is it only ASCII?). If it is to do with the character coding, then I would appreciate advice or suggestions on the way around it.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/