I think this will necessarily involve some manual work & a familiarity with Stata's string functions. Using the <word()> function will get you part of the way there:
. gen vend = word(vendor,1)
. table vend
--------------------------
vend | Freq.
--------------+-----------
STRYKER | 4
STRYKERITALIA | 1
STYKER | 1
SULZER | 10
SULZERMEDICA | 1
ZIMMER | 6
--------------------------
.
But as you can see, we also got STYKER, STRYKERITALIA, and SULZERMEDICA. The Stykers of the world (i.e., typos) are going to cause you the most trouble. I view this as a necessary part of data analysis.
Eric
>Dear Statalist
>I have a dataset with a string variable named VEND. It contains a lot of different companies with a varied different names although often they indicate the same company.
>For example for three different firms
>
>STRYKER ITALIA SRL
>STRYKER ITALIA SRL -
>STRYKER ITALIA SRL S
> STRYKER SRL
> STRYKERITALIA
> STYKER ITALIA SRL
> SULZER
> SULZER MEDICA
>SULZER OR ITALIA SPA
> SULZER ORTHOPEDICS
>SULZER ORTHOPEDICS I
> SULZER ORTHPEDICS
> SULZER ORTOPEDIC
>SULZER ORTOPEDICA IT
>SULZER ORTOPEDICS IT
> SULZER PROTEK
> SULZERMEDICA
> ZIMMER
> ZIMMER - NEX GEN
> ZIMMER ARL
> ZIMMER S.R.L.
>ZIMMER S.R.L. (C
> ZIMMER SRL
>
>
>
>Where the names are easily
>STRYKER
>SULZER
>ZIMMER
>
>How can I replace these strings with the same cluster name?
>Do you know if there is a similar command as
>. replace vend if vend=="zimmer***"
>or I have to build a do file with a lot of -substr- and -index- command
>Thanks in Advance
>Paolo Grillo
--
===================================================
Eric G. Wruck
Econalytics
2535 Sherwood Road
Columbus, OH 43209
ph: 614.231.5034
cell: 614.330.8846
eFax: 614.573.6639
eMail: [email protected]
website: http://www.econalytics.com
====================================================
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/