I recall some fuss about this in _Nature_ and _Science_
a couple of years ago. I wrote a couple of programs
and then got bored, or distracted, before I wrote the help.
Anyway, both programs support -by:-.
------------------------------------ hindex.ado
*! NJC 1.0.0 21 Sept 2005
program hindex, byable(recall) rclass sort
version 8
syntax varname(numeric) [if] [in]
quietly {
marksample touse
count if `touse'
if r(N) == 0 error 2000
tempvar negvar rank
gen `negvar' = -`varlist'
bysort `touse' (`negvar'): gen `rank' = _n
bysort `touse' `negvar' (`rank'): replace `rank' = `rank'[_N]
su `rank' if (`rank' <= `varlist') & `touse', meanonly
}
di _n as txt "h-index " as res %3.0f r(max)
return scalar hindex = r(max)
end
----------------------------------
---------------------------------- _ghindex.ado
*! 1.0.1 NJC 17 Oct 2005
*! 1.0.0 NJC 21 Sept 2005
program _ghindex
version 8
syntax newvarname =/exp [if] [in] [, BY(varlist) ]
marksample touse, novarlist
tempvar GRV
quietly {
gen double `GRV' = -(`exp') if `touse'
markout `touse' `GRV'
bysort `touse' `by' (`GRV'): gen `typlist' `varlist' = _n
by `touse' `by' `GRV': replace `varlist' = `varlist'[_N]
replace `varlist' = 0 if (`exp') < `varlist'
bysort `touse' `by' (`varlist'): ///
replace `varlist' = `varlist'[_N]
if "`by'" != "" local by " by `by'"
label var `varlist' "h-index of `exp'`by'"
}
end
---------------------------------
and I worked out this recipe for h-index of -response-, by -byvar-:
bysort byvar : egen temp = rank(-response), unique
bysort byvar response : egen rank = max(temp)
by byvar : egen hindex = max(rank) if response >= rank
tabdisp byvar if response >= rank, cell(hindex)
Also in 2005, for what's it worth, I sent this to Nature, but
it was rejected (meaning, it was not published, and no-one
replied).
-------------------------------------------------------
The h-index of Jorge Hirsch (http://xxx.arxiv.org/abs/physics/0508025;
Nature 436, 900; 2005) is the highest number of papers a scientist has
that have each received at least that number of citations. As a measure
of both productivity and impact, it is offered as a objective criterion
for decisions on say tenure, promotion and elections to distinguished
societies.
The index is so simple and elegant that one wonders why it has
apparently not been suggested before as a descriptive statistic.
However, application to other ranked counts, such as species abundance
data from ecology, suggests that its bibliometric success depends partly
on the coincidence that numbers of publications and of citations per
paper are often close to each other. This reflects both the sizes of
specialisms and various publication conventions.
On the evidence so far, mostly from physics, the h-index works rather
well, at least for comparing people in the same field who work in the
same way. But that is the nub of the matter. Can it acceptably rank
people in very different fields, even with a range of benchmarks? Like
any one-dimensional summary based on publications, it ducks the question
of how to assess outputs in other form (patents, software, etc.).
Sciences vary considerably in how far publications are single-authored
or multi-authored or very short or much longer, so number of papers is a
dubious metric on those grounds alone. The fraction of past or even
current literature covered in major databases also should not be assumed
either high or constant across disciplines. Some scientists publish
sparsely but durably in small and unspectacular but nevertheless
fundamental fields (e.g. specialisms in systematic biology). The
h-index also undervalues the contribution of those who publish a few
outstandingly important papers (deep theorems such as that of
Fermat-Wiles, fundamental new discoveries, very widely used methods). If
scientists were to plan with the h-index in mind, the incentive to write
books or review papers would decrease markedly, as each could have only
a marginal effect compared with writing conventional papers. Science
could only suffer as a result.
-------------------------------------------
Nick
[email protected]
Pierre Azoulay
> I am trying to calculate the so-called h index for a large number of
> scientists. The h index of a scientist and the highest integer h such
> that the scientist has h papers cited at least h times.
>
> For example, for the scientist below, the h index is 19.
>
> scientist_id article_id nbcites
> GEORGE 10101157 8
> GEORGE 12242494 10
> GEORGE 11156976 12
> GEORGE 9409826 19
> GEORGE 7635312 23
> GEORGE 7799970 23
> GEORGE 11290701 28
> GEORGE 8034742 42
> GEORGE 8334302 43
> GEORGE 2656402 74
> GEORGE 2005819 79
> GEORGE 2643162 111
> GEORGE 8943317 127
> GEORGE 1956405 146
> GEORGE 9314530 153
> GEORGE 2404021 204
> GEORGE 3049620 302
> GEORGE 2195038 373
> GEORGE 2476649 393
> GEORGE 2005809 527
> GEORGE 6365931 614
> GEORGE 6365930 670
>
>
> I have written a program that calculates this for one scientist (see
> below). The problem is that I have a very large number of scientists,
> and so would like to combine the program below with "by scientist_id:"
>
> I am not sure exactly how to do that in stata. Could any one help?
>
> Thanks,
>
> Pierre
>
>
> gen h_index=.;
> local N = _N;
> forvalues i = 1(1)`N'
> {;
> display `i';
> replace h_index=`N'-`i'+1 if
> (nbcites[`i']>=`N'-`i'+1 & h_index==.);
> replace h_index=`N'-`i'+1 if (nbcites[`i']>=`N'-`i'+1 &
> h_index<`N'-`i'+1 & h_index!=.);
> };
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/