• No results found

Eight clusters : a dynamic perspective and structural analysis for the evaluation of institutional research performance

N/A
N/A
Protected

Academic year: 2021

Share "Eight clusters : a dynamic perspective and structural analysis for the evaluation of institutional research performance"

Copied!
13
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Eight clusters : a dynamic perspective and structural analysis for the evaluation of institutional research performance

Thijs, B.

Citation

Thijs, B. (2010, January 27). Eight clusters : a dynamic perspective and structural analysis for the evaluation of institutional research performance. Retrieved from

https://hdl.handle.net/1887/14617

Version: Not Applicable (or Unknown)

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/14617

Note: To cite this publication please use the final published version (if applicable).

(2)

2 C

LASSIFICATION MODEL

One of the goals of the development of the classi cation model is to enhance comparability in evaluative studies of research institutes As stated above institutes of higher education do not have a monopoly on scienti c research and are not the only organizations that receive public funding for their research activities.

Therefore this model is not limited to these institutes of higher education but is applied to all organizations that publish research papers in scienti c journals.

In this thesis both terms institute as institution can be used to indicate organizations that publish scienti c research papers.

2.1 Address cleaning

One of the major tasks prior to the creation of the classi cation model is the assignment of papers to institutes. This is a crucial and vital step in the complete analysis. If we would be unable to assign publications to institutes at acceptable and satisfying level any statement about these institutes research performance would be invalid. For this reason it is appropriate to give some insight in the procedures used to clean the data.

This project started in 2004 with the creation of the  rst version of a cleaning procedure. At this stage it was applied to a few European countries. In a next phase the methodology was applied to Brazilian institutes as joint work with Jaqueline Leta resulting in two papers (Glänzel, et al., 2006 and Leta et al., 2006).

The procedure was then extended to EU15-counties and some Eastern European countries. France and UK appeared to be dif cult to handle.

2.1.1 Using rough data extracted from WoS database

One of the advantages of the Web of Science (WoS) database is that the full address information of all addresses in the by-line of a paper is recorded. These addresses are stored twice in the database. First, it is stored as a ‘full address’ just as it appears on the paper and next it is stored in a parsed form. Here some additional parsing of the address information in sub elds is done by Thomson Scienti c. The name of the institute, department, street, zip code, city, state, country are extracted from the full address. From 1998 onwards, additional information on department (at 3 levels) is extracted from the street  eld. The  rst  eld in the parsed address is crucial in our cleaning procedure

2.1.2 Three step procedure

A procedure in three steps was developed for the cleaning and assigment of raw addresses.

(3)

Step 1: A list with variations is created. All addresses from one country are selected. These addresses are used to create a list of all possible distinct occurrences in the institute  eld. We assume that this list contains all synonyms and spelling variance of the institutes. It is also counted how many times each of these distinct names occur in the total set of addresses. Only those names with a reasonable frequency are selected for the next steps

Fig. 1. Schematic overview of the three-step procedure

Step 2: A thesaurus and a list of unique institutes are created. Each name from the list created in step 1 is, if possible, linked to an entry in the list of unique institutes. Each entry in the unique list also gets an identi cation number. If the name in the variation list cannot be assigned internet search engines like ‘Google’

or ‘Live Search’ are used to try to  nd the corresponding institute and its of cial name. If needed the institute is added to the unique list and the thesaurus is adopted.

(4)

Step 3: Addresses are matched with the thesaurus. All addresses from the selected county are matched against the thesaurus. Here not only the institute  eld in the parsed address entry is used but also the three department and the street

 elds are included in the cleaning procedure. Figure 1 gives a schematic overview of these three steps.

2.1.3 Results of procedure

Table 1 gives the result of the cleaning procedure for 15 selected European countries. For all countries (except Luxemburg) more than 70% of all addresses were assigned to an institution. For many countries the share of assigned addresses is even above 85%. From our experience with Belgian addresses where 98.7% was assigned we conclude that the remaining part of addresses often belongs to either small organizations or private persons or companies that publish a few papers.

Country Assigned

Addresses

Number of Institutes

Austria 87.8% 71

Belgium 98.7% 585

Denmark 92.8% 77

Finland 95.5% 91

France 71.9% 249

Germany 75.9% 206

Ireland 93.0% 58

Italy 88.0% 291

Luxemburg 66.0% 19

Netherlands 88.4% 149

Portugal 90.1% 55

Spain 83.9% 266

Sweden 93.1% 124

Switzerland 85.2% 78

UK 87.9% 456

Table 1. Results of address cleaning procedure in 15 European countries

Of course, when publication data is used for funding or evaluation of individual institutes, one should aim at collecting all publications of an institution.

2.2 Research profi les

Once the papers of the individual institutes were identi ed the next step can be taken. For each of the selected institutes a research pro le is constructed.

As we do not need a  ne-grained subject structure the use the 16 major  elds in

(5)

the natural and life sciences (13  elds), social sciences (2  elds) and humanities (1  eld) as developed in Leuven and Budapest is applied. (Glänzel & Schubert, 2003).

A research pro le can be seen as a vector in the ‘ eld-space’ representing the share of each of the 16  elds in the total set of publications of the speci c institute.

Agriculture Biology BioSciences Biomedical Research General & Internal Medicine Non-internal Medical Specialties Neuroscience & Behavior Chemistry

K.U.Leuven 6% 9% 14% 8% 17% 15% 6% 16%

Institut Français de recherche

pour l’exploitation de la Mer 29% 45% 10% 3% 3% 1% 0% 8%

TU Delft 8% 4% 3% 2% 0% 2% 1% 33%

Karolinska Institutet 1% 7% 18% 14% 33% 37% 13% 3%

Physics Geo & Space Sciences Engineering Mathematics General, Regional & Community issues Economical and Political issues Arts & Humanities Multidisciplinary

K.U.Leuven 15% 3% 10% 4% 3% 3% 3% 1%

Institut Français de recherche

pour l’ exploitation de la Mer 3% 31% 4% 0% 0% 0% 0% 1%

TU Delft 31% 8% 30% 7% 2% 1% 1% 1%

Karolinska Institutet 1% 0% 1% 0% 2% 0% 0% 2%

Table 2. Research pro le of 4 different research institutes

This data is standardized and takes only values between 0 and 1. As a result the total number of papers an institute produces within a certain time frame has a minor effect on their pro le. However this does not apply for institutes with a very low publication output as an increase of activity in one  eld may have large consequences on both the share in this  eld and all the others. In order to keep the in uence of these small institutes within reasonable limits, only institutes

(6)

with publication output above a given threshold are selected. In the  nal version of the classi cation model 1767 institutes were used to create groups of likewise institutes.

Of course, this research pro le can be calculated for each set of publications, by country, institute, author and over any publication window. Table 2 gives the research pro les of 4 different research institutes.

In the  rst version of the classi cation model –described in the paper about self-citations at the meso level, see Part II, chapter 1- a principal component analysis (PCA) was performed in order to reduce the number of variables or  elds.

This resulted in 9 components accounting for 76,7% of the total variance in the data. After further analysis (rotation) of these components, scores were calculated for all institutes. However, in later studies this PCA was abandoned as it did not have any added value for the analysis. It only reduced the available information that could be used for the clustering.

We also tried to use a broader classi cation system (60 sub elds, or even the 200 WoS journal categories) but this resulted often in vectors containing too many zeros and thus complicating needlessly the clustering.

2.3 Clustering

In order to create groups of institutes a cluster analysis is most appropriate.

Of course a more dif cult choice concerns the methodology, algorithm and distance measures and  nal number of clusters. Several different models were used and tested. As mentioned above, a  rst version used the data after PCS. Then a second version of the model was used in the paper on Brazilian institutions (Leta et al, 2006). These models were then adapted and improved. Proper testing and validation was done and the  nal version was published in Scientometrics in 2008.

Because hierarchical clustering was chosen the  rst model can be derived from the

 nal one by aggregating at a higher level. However, the ‘stopping rules’ applied in the cluster algorithm suggest a eight cluster solution instead of the six clusters used in the  rst version.

2.3.1 Hierarchical clustering

A hierarchical clustering methodology was used to build sets of similar institutes. The advantage of hierarchical clustering is that there is no need for an a priori speci cation of number of clusters. Wards agglomerative clustering methodology was chosen in combination with Squared Euclidean distances between institutes. Ward (1963) proposed a clustering algorithm which forms groups of items by minimizing the loss associated with each possible grouping.

Ward de ned loss in terms of variance, in terms of sum of squared errors. At each step, this is calculated for each possible combination of available groups and items. The grouping with the least sum of squares is the next grouping. One of the criticism that can be formulated on Wards method is that this algorithm

(7)

tends to create one large group and several smaller ones. This is also the case with the clustering of institutes. However, extensive validating was done and Ward algorithm gave the most satisfying results.

2.3.2 Number of clusters

Crucial in a cluster analysis is the choice of number of cluster. With hierarchical clustering at each step in the procedure one has to decide whether or not to stop with partitioning. Several stopping rules for hierarchical clustering are developed by many authors. Others authors elaborated on the evaluation and comparison of these rules. Milligan and Cooper (1985) have compared 30 different methods for the estimation of optimal number of clusters.

Fig. 2. Results of two stoppings rules

In the  rst and second version of the classi cation these statistical rules were not used, only inspection of different cluster solutions was applied to judge the best number of cluster.

In the  nal study two different methods rated best in the Milligan and Cooper study were applied to determine the number of clusters, particularly, the pseudo-F index according to Calinski and Harabasz (1974) and Je(2)/Je(1) index introduced by Duda and Hart (1973). Large values of these indexes indicate distinct clustering. Figure 2 shows the results for different number of clusters (Duda- Hart’s Je(2)/Je(1) index is multiplied by 100). These results are not supportive for one particular number of clusters. The two-cluster solution suggested by both the Je(2)/Je(1) and the pseudo-F index only distinguishes between medical and non-

(8)

medical institutions. This rough classi cation is trivial and proved not to be useful for the grouping of institutes with a similar research pro le. The Duda-Hart Je(2)/

Je(1) suggests eight clusters as the second optimum solution.

2.3.3 Eight cluster solution

These eight clusters can be characterised by the average research pro le of the members of the respective cluster: Biology (1. BIO), Agriculture (2. AGR), Geo- and Space sciences (4. GSS), Technical and Natural sciences (5. TNS), Chemistry (6. CHE), General and Research Medicine (7. GRM), Specialised Medical (8. SPM) and a cluster with institutes having a Multidisciplinary pro le (3. MDS).

Field BIO AGR MDS GSS TNS CHE GRM SPM

Agriculture (A) 20.0% 66.7% 6.0% 6.2% 4.9% 4.6% 0.2% 0.6%

Arts & Humanities (U) 0.0% 0.0% 0.3% 0.0% 0.2% 0.1% 0.1% 0.1%

Biology (Z) 57.2% 25.1% 9.6% 1.8% 2.3% 1.6% 4.1% 5.2%

Biomedical Research (R) 7.2% 5.2% 11.2% 0.1% 1.5% 1.2% 9.3% 9.8%

Biosciences (B) 14.6% 5.2% 15.5% 0.5% 2.6% 2.1% 7.0% 5.9%

Chemistry (C) 5.5% 14.3% 16.8% 2.9% 27.8% 86.6% 0.6% 1.1%

Engineering (E) 1.4% 5.1% 9.0% 4.0% 35.0% 7.3% 0.6% 0.7%

General & Internal medicine (I) 9.0% 1.2% 12.7% 0.2% 0.8% 0.1% 64.8% 31.2%

Geo- & Space science (G) 7.1% 5.5% 4.5% 87.4% 7.7% 1.3% 0.0% 0.1%

Mathematics (H) 0.8% 0.7% 5.7% 0.5% 6.9% 0.7% 0.1% 0.2%

Neuroscience & Behavior (N) 1.8% 0.5% 9.1% 0.0% 1.1% 0.1% 2.6% 4.8%

Non-internal medicine specialties (M)

7.7% 4.0% 16.8% 0.3% 2.5% 0.9% 33.5% 61.9%

Physics (P) 1.4% 3.1% 10.5% 4.7% 37.0% 10.8% 0.4% 0.6%

Social Sciences I (S) 0.6% 0.3% 1.7% 0.1% 0.4% 0.0% 0.7% 1.5%

Social Sciences II (O) 0.3% 0.5% 1.3% 0.0% 0.3% 0.2% 0.0% 0.1%

Table 3. Research pro le per cluster as expressed by the  eld representation of their research output

Table 3 presents the average research pro le for each of the eight clusters.

Activity higher than 15% is highlighted. These pro les show a distinct and clear specialization within each individual cluster. Clusters 4 and 6 (GSS and CHE, respectively) are characterized by an extremely high degree of specialization;

almost 90% of their research activities are devoted to one single  eld each. Even the two medical clusters (GRM and SPM) show the same high specialization which is, however, split up and distributed over the two subject  elds in ‘clinical and experimental medicine’; but the 2/3–1/3 proportion of these distributions are both contrary and complementary. The composition of the other clusters is more multidisciplinary. The TNS cluster (5) comprises the natural and technical sciences, the clusters BIO and AGR show considerable activity in the  elds

(9)

‘agriculture’ and ‘biology’, interesting enough, almost mirroring the same contrary and complementary picture of GRM/SPM found in the life sciences. Finally, the third cluster (MDS) has been found a truly multidisciplinary cluster with activity in all science  elds and less skewed publication distributions over  elds; no  eld has a higher share than 20% here.

Social sciences and humanities make up only a very small share of the activity of the European institutions in all clusters.

The number of institutes assigned to each cluster is presented in table 4.

The clusters with multidisciplinary institutes and institutes with specialized medicine comprise jointly about 57% of all institutes. The clusters with Biology, Technical and Natural Sciences and the General and Research Medicine still hold a reasonable share of institutes while the three small clusters , the Geo- and Space Sciences, Agriculture and Chemistry each hold about 3% of all institutes. The existence of one large cluster is often an undesired effect of the chosen linkage method (Ward) but in this case inspection of the data is clearly supportive for one larger multidisciplinary group.

Cluster Code Counts Share

Cluster 1 (Biology) BIO 146 8.3%

Cluster 2 (Agriculture) AGR 59 3.3%

Cluster 3 (Multidisciplinary) MDS 550 31.1%

Cluster 4 (Geo- & Space Science) GSS 57 3.2%

Cluster 5 (Technical & Natural) TNS 261 14.8%

Cluster 6 (Chemistry) CHE 51 2.9%

Cluster 7 (General & Research Med.) GRM 182 10.3%

Cluster 8 (Specilased Medicine.) SPM 461 26.1%

Total 1767 100.0%

Table 4. Number of research institutes in each of the eight clusters

Some of the most typical members of each group are listed here:

‘Wageningen University and Research Centre’ in group BIO has about 40% of all publications assigned to the  eld “biology”. The ‘Danish Institute of Agricultural Sciences’ is with 72% in agriculture a true representative of the AGR-group. In the third cluster (MDS) most of the large European universities with many specialties are grouped. Examples are: ‘Catholic Univeristy of Louvain (K.U.Leuven)’ or

‘LMU Munich’. Obviously, most of the national research councils are included in this cluster as well. In the fourth cluster (Geo &Space Sciences) we can  nd the Italian ‘National Insitute for Astrophysics (INAF)’ and the Spanish ‘Astrophysics Institute of the Canary Islands (IAC)’. The French institute ‘CEA (Commissariat à l’Énergie Atomique)’ is one of the typical members of the group specialized in Technical and Natural Sciences, others are ‘Delft Univerrsity of Technology’

and ‘IMEC’ in Belgium. In the Chemical group we can  nd ‘BASF AG’, ‘Institut

(10)

Francais du Petrole’ as well as the Dutch company ‘DSM’. Our grouping resulted in two clusters with a main focus on medical sciences. In the  rst medical group (GRM) the focus is more on general and research medicine. General hospitals make up a large part of this group, e.g. ‘Niguarda CA Hospital of Milan’. Other institutes in this group are the ‘European Institute of Oncology’ or ‘Netherlands Cancer Institute and Antoni van Leeuwenhoek Hospital’. In the last cluster (SPM) we  nd several universities like the ‘Erasmus University Rotterdam’ or ‘Medical University of Lübeck’. We can also  nd specialized institutes like the ‘National institute for the rest and care of the elderly’ or ‘Nuf eld Orthopaedic Centre’

among the institutions with specialized medical pro le.

2.4 Classifi cation model

After the clustering of 1767 institutes, a classi cation model was created for the assignment of other research institutes or other research pro les to one of the eight groups. The institutions in the cluster analysis are used as the training set.

The resulting model can then be applied to any other institution. For the creation of this model we used discriminant analysis. This technique uses linear functions (latent variables) of the predictive variables in the dataset, in this case the vector representing the research pro le. Each function or canonical root classi es cases into one of two possible groups. By adding a function to the analysis it is possible to distinguish between one more group. This means that for the classi cation into eight separate groups we need seven different linear functions. Discriminant analysis can be disturbed by out-liners but these cases were mostly removed by excluding institutes with less than 20 publications.

Function Eigenvalue % of Variance Cumulative % Canonical Cor.

1 7.572(a) 36.4 36.4 0.94

2 4.573(a) 22 58.4 0.906

3 4.290(a) 20.6 79 0.901

4 1.320(a) 6.3 85.3 0.754

5 1.179(a) 5.7 91 0.736

6 1.171(a) 5.6 96.6 0.734

7 .700(a) 3.4 100 0.642

Table 5. Eigenvalues and statistitcal functions of the discriminant analysis

The statistics for these seven discriminant functions are presented in table 5. The eigenvalue indicates the importance of this function for the classi cation of cases into the given groups. The ‘% of variance’ statistic is the share that each function has in the total of the explained variance. In this case, the  rst three functions account for nearly 80% of all explained variables. The canonical correlation,  nally, gives information on the association of the grouping

(11)

by the discriminant function and the dependent variable. Each of these functions also have a signi cant Wilks’ Lambda value which indicates that different groups have indeed different, discriminating mean values on this function.

Based on these functions a classi cation model is constructed using Fisher’s coef cients. This results in eight different linear functions, each assigned to one of the different groups. For each observation the resulting value for each function can be calculated. The observation is assigned to the group connected to the function with the highest value. The exact coef cients of each classi cation function can be found in Appendix 1.

The main outcome of the discriminant analysis can be summarized as follows. The constructed model was used to reclassify the 1767 institutes into the 8 different groups. 93% of all institutes were assigned correctly. This ratio, the signi cant values of Wilks’ Lambda and the Canonical Correlations substantiate the predictive power of the model and justify the use of this model as a classi cation tool.

(12)

References

Calinski, T., Harabasz, J. (1974), A dendrite method for cluster analysis.

Communications in Statistics, 3, 1-27.

Duda, R.O., Hart, P.E. (1973), Pattern Classifi cation and Scene Analysis. New York: Wiley.

Glänzel, W. (2001), National characteristics in international scienti c co- authorship, Scientometrics, 51 (1), 69-115.

Glänzel, W., Schubert, A.(2003), A new classi cation scheme of science

 elds and sub elds designed for scientometric evaluation purposes, Scientometrics, 56 (3), 357–367.

Glänzel, W., Leta, J., Thijs, B. (2006), Science in Brazil. Part 1: A macro-level comparative study. Scientometrics, 67 (1), 67-86.

Leta, J., Glänzel, W., Thijs, B. (2006), Science in Brazil. Part 2: Sectoral and institutional research pro les. Scientometrics, 67 (1), 87-105.

Milligan, G.W., Cooper, M.C. (1985), An examination of procedures for dermining the number of clusters in a data set. Psychometrika, 50(2), 159-179.

Thijs, B., Glänzel, W. (2006), The in uence of author self-citations on bibliometric meso-indicators. The case of European universities.

Scientometrics, 66 (1), 71-80.

Ward, J.H. (1963). Hierarchical Grouping to optimize an objective function.

Journal of American Statistical Association, 58(301), 236-244.

(13)

Referenties

GERELATEERDE DOCUMENTEN

Eight clusters : a dynamic perspective and structural analysis for the evaluation of institutional research performance..

Given the utmost interesting macro and meso  gures presented in tables 8, 11 and above all, in table 13 we would like to conclude that at the meso level, the share of

Unlike in the analysis of individual universities described by van Raan (2004) where a ‘spectral analysis’ of the output based on ISI Subject Categories is applied, we do not

Here, however, a  eld based expected citation rate is used expressing the average number of citations received over all papers published within a speci c  eld within one

In a second section we looked at three different citation indicators to study the effect of the different types of collaboration on the impact of research and the

Eight clusters : a dynamic perspective and structural analysis for the evaluation of institutional research performance..

22- MASTER’S (COMPREHENSIVE) UNIVERSITIES AND COLLEGES II: These institutions offer a full range of baccalaureate programs and are committed to graduate education through the

I would like to take the opportunity that this dissertation offers to thank several persons who made it possible for me to do research and to  nish this dissertation