-
Notifications
You must be signed in to change notification settings - Fork 1
/
scrape_decisions_GFCC.Rmd
498 lines (302 loc) · 17.8 KB
/
scrape_decisions_GFCC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
---
title: "Automatic Collection of Chamber Decisions of the German Federal Constitutional Court"
author: "Sebastian Sternberg"
date: "06/08/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Load necessary packages. Scraping is mainly done using rvest.
```{r}
rm(list = ls())
library(rvest)
library(magrittr)
library(stringr)
require(stringi)
require(dplyr)
require(strex)
```
This extracts all the links to each individual Chamber decision in the first step. These links are then used to extact the information in the second step. This should take around 45 minutes to finish, depending on the sys.sleep chosen.
The starting site:
#https://www.bundesverfassungsgericht.de/SiteGlobals/Forms/Suche/Entscheidungensuche_Formular.html?gtp=5403124_list%253D1&submit=Senden&dateAfter=01.01.1998&dateBefore=31.12.2019&language_=de
7013 links are expected:
Iterate over the link:
```{r}
pages <- c(1:702) # there are 702 pages to scrape (702 pages with each 10 decisions), if search is narrowed down to decisions between 01.01.1998 and 31.12.2019 and language = German
# getting the links to the decisions
links <- c()
for (i in 1:length(pages)){
message("Getting decision ", i)
html <- read_html(paste("https://www.bundesverfassungsgericht.de/SiteGlobals/Forms/Suche/Entscheidungensuche_Formular.html?gtp=5403124_list%253D",
i, "&submit=Senden&dateAfter=01.01.1998&dateBefore=31.12.2019&language_=de",
sep = "")
)
href <- html %>% html_nodes(".relevance100+a") %>% html_attr("href")
links <- c(links, href)
message("taking a break of 1.0 seconds")
Sys.sleep(1.0)
}
links
save(links, file = "links_decisons.Rda")
```
Now we have the raw links. These need to be cleaned, such that we can call each link in the next step.
```{r}
# select the decisions' links
decisions <- links[grep("^SharedDocs/Entscheidungen/",links)] %>% strsplit(";") %>% sapply("[", 1)
# the links must look like this: https://www.bundesverfassungsgericht.de/SharedDocs/Entscheidungen/DE/2019/07/rs20190730_2bvr168514.html
#generate the correct link be adding the necessary prefix to it:
decisions <- gsub("SharedDocs/", "https://www.bundesverfassungsgericht.de/SharedDocs/", decisions)
```
In the second step, we want to collect information from the decision texts. For this, we loop over each decision link. On each of these pages with the decision texts, we obtain the required information using the respective html node path. The following shows how this works for one decision:
```{r}
# Testing for one text
case <- read_html("https://www.bundesverfassungsgericht.de/SharedDocs/Entscheidungen/DE/2019/05/qk20190524_1bvq004619.html")
header <- case %>% html_nodes("#navBreadcrumbs strong") %>% html_text()
cite <- case %>% html_nodes(".cite") %>% html_text()
chamber_senate <- case %>% html_nodes(".rr1") %>% html_text()
judge1 <- case %>% html_nodes(".absatz:nth-child(8) span") %>% html_text()
judge2 <- case %>% html_nodes(".absatz:nth-child(9) span") %>% html_text()
judge3 <- case %>% html_nodes(".absatz:nth-child(10) span") %>% html_text()
link_decision <- case %>% html_nodes(".cite a") %>% html_text()
#http:https://www.residualthoughts.com/2018/03/28/topic-analysis-of-tim-ferris-podcast-using-rvest/
```
Now we do this for all the decisions. If you have access to a VPN, it might help to avoid connection losses during the scraping process.
```{r}
list_all <- list()
for(i in 1:length(decisions)){
message("Scraping decision ", i)
case <- read_html(decisions[i])
header <- case %>% html_nodes("#navBreadcrumbs strong") %>% html_text()
cite <- case %>% html_nodes(".cite") %>% html_text()
chamber_senate <- case %>% html_nodes(".rr1") %>% html_text() %>% str_c(collapse = " ")
judges_names <- case %>% html_nodes(".rr2 span") %>% html_text()
judges_names1 <- case %>% html_nodes(".rr2") %>% html_text() %>% strsplit("\n") %>% unlist() %>% trimws() %>% stri_remove_empty() #some of the names have empty/trailing whitespace which must be first removed to assign the names of the judges correctly
if(length(judges_names) == 0){
judge1 <- judges_names1[1]
judge2 <- judges_names1[2]
judge3 <- judges_names1[3]
judges_names <- str_c(judges_names1, collapse = " ")
} else{
judge1 <- judges_names[1]
judge2 <- judges_names[2]
judge3 <- judges_names[3]
judges_names <- str_c(judges_names, collapse = " ")
}
if(length(judges_names1) == 0){ #if neither name at html position of judgename and judgename1, write NA
judges_names <- "Not found in header"
judge1 <- NA
judge2 <- NA
judge3 <- NA
}
#if there is no chamber and senate write Not found
if(length(chamber_senate) == 0){
chamber_senate <- "Not found"
}
#get link to decision
link_decision <- case %>% html_nodes(".cite a") %>% html_text()
#do not extract the aktenzeichen; this can be done via the header later; the reason is that the AZ link is error prone if multiple of them are collected together
#aktenzeichen <- case %>% html_nodes(".az2 span") %>% html_text()
#aktenzeichen <- aktenzeichen[1] #extract only the first AZ; this is also the one shown in the link of a decision
list_all[[i]] <- data.frame(header = header, cite = cite, chamber_senate = chamber_senate,
judges_names = judges_names,
judge1 = judge1, judge2 = judge2, judge3 = judge3,
link_decision = link_decision)
#Display message
message("Iteration ", i, " of ", length(decisions), " successful.")
message("taking a break")
Sys.sleep(1.5)
}
df_all_decisions <- do.call("rbind", list_all) #bind them all together in a data frame from the list.
```
This should take around 7176*1.5/60 minutes, depending on how much sys.sleep you choose.
## Data Cleaning
Clean the decisions. First drop the decisions with an "Urteil", because these are not Chamber decisions
```{r}
source("helper_functions.R")
load("~/Dropbox/Kaggle/Scraper_GFCC_Decisions/scraper-decisions-German-Federal-Constitutional-Court/df_all_decisions3")
#load the
chamber_decisions <- df_all_decisions[!grepl("Urteil", df_all_decisions$header), ]
#some of the remaining ones are a "Beschluss", but also not a chamber decisions. Because we know that sometimes there is a mistake in the citings, we make double sure that we do not drop the wrong decisions by checking whether the word "Kammer" is in the header of the decision and the suggested citation:
chamber_decisions$check_if_chamber1 <- grepl("Kammer", chamber_decisions$cite)
chamber_decisions$check_if_chamber2 <- grepl("Kammer", as.character(chamber_decisions$chamber_senate))
#not always true
table(chamber_decisions$check_if_chamber1 == chamber_decisions$check_if_chamber2)
#only keep the decisions which are chamber decisions. Keep if one of the both is true or all is true. This is because sometimes, in the citation there is no chamber named, but also sometimes not in the chamber senate header. If one is true, it is true in general.
chamber_decisions <- chamber_decisions[chamber_decisions$check_if_chamber1 == TRUE & chamber_decisions$check_if_chamber2 == FALSE |
chamber_decisions$check_if_chamber1 == FALSE & chamber_decisions$check_if_chamber2 == TRUE |
chamber_decisions$check_if_chamber1 == TRUE & chamber_decisions$check_if_chamber2 == TRUE, ]
#clean the aktenzeichen
#There are 12 decisions where there is no az in the header. Thus, ifelse is used to replace these with NA such that the
#length matches the nrow of the data.
chamber_decisions$az <- str_extract(chamber_decisions$header, "\\d \\w\\w\\w \\d+...") %>% unlist
# chamber_decisions$az <- ifelse(is.na(str_extract(chamber_decisions$header, "\\d \\w\\w\\w \\d+...") %>% unlist), NA, str_extract_all(chamber_decisions$header, "\\d \\w\\w\\w \\d+...") %>% unlist)
chamber_decisions <- chamber_decisions %>% select(az, everything())
#create date
for(i in 1:nrow(chamber_decisions)){
tmp <- strsplit(as.character(chamber_decisions$header[i]), " ") %>% unlist()
chamber_decisions$date[i] <- str_c(tmp[3:5], collapse = " ")
}
chamber_decisions$date <- german_date_to_date(chamber_decisions$date)
#extract the senate the chamber decision was decided in
chamber_decisions$senat <- str_extract(chamber_decisions$az, "[0-9]+")
chamber_decisions$senat_check <- ifelse(grepl("Ersten Senats", as.character(chamber_decisions$chamber_senate)), 1, NA)
chamber_decisions$senat_check[grepl("Zweiten Senats", as.character(chamber_decisions$chamber_senate))] <- 2
table(chamber_decisions$senat_check, useNA = "ifany")
#There are some mistakes done by the court:
table(chamber_decisions$senat_check == chamber_decisions$senat)
#This is the wrong decision:
#https://www.bundesverfassungsgericht.de/e/rk20090224_1bvr018909.html
#correct this manually:
chamber_decisions$senat[chamber_decisions$az == "2 BvR 189/09"] <- "1"
#check again:
table(chamber_decisions$senat_check == chamber_decisions$senat)
#extract the chambers
chamber_decisions$kammer_cit <- str_extract(as.character(chamber_decisions$cite), "[0-9]+")
chamber_decisions$kammer_txt <- str_extract(as.character(chamber_decisions$chamber_senate), "[0-9]+")
#check:
table(chamber_decisions$kammer_cit == chamber_decisions$kammer_txt) #there are rare cases where the court makes an error when writing the citation. We will keep track of this later. An example is for instance this decision: http:https://www.bverfg.de/e/rk20130630_2bvr008513.html
chamber_decisions$is_same_chamber <- chamber_decisions$kammer_cit == chamber_decisions$kammer_txt
#kammer_txt is always the right one.
```
There are some missings in the data set we have to take care off.
```{r}
####take care of NAs
cat("Before cleaning there are", sum(is.na(chamber_decisions)), "NAs in the data set")
#fill NAs of kammer_txt (with kammer of citation)
chamber_decisions$kammer_txt[is.na(chamber_decisions$kammer_txt)] <- chamber_decisions$kammer_cit[is.na(chamber_decisions$kammer_txt)]
#some NAs in the AZ:
chamber_decisions$link <- as.character(chamber_decisions$link)
chamber_decisions$az <- as.character(chamber_decisions$az)
#We extract them from the link:
chamber_decisions$az[is.na(chamber_decisions$az)] <- str_extract(chamber_decisions$link[is.na(chamber_decisions$az)], "\\dbvr\\d+")
#take care of NAs in Senat. Senat is always the first number of the AZ:
chamber_decisions$senat[is.na(chamber_decisions$senat)] <- str_first_number(chamber_decisions$az[is.na(chamber_decisions$senat)])
#check:
sapply(chamber_decisions, function(y) sum(length(which(is.na(y)))))
```
Now we have to take care of the names. This is complex and requires some manual editing.
```{r}
#extracting the judges names
for(i in 1:nrow(chamber_decisions)){
chamber_decisions$richter_txt[i] <- c(names_extract(chamber_decisions$judge1[i]), names_extract(chamber_decisions$judge2[i]),
names_extract(chamber_decisions$judge3[i])) %>%
str_c(collapse = " ") %>%
gsub("\\s", ",", .) #to bring it in the form for later analyses
}
#There are some NAs we have to take care off:
chamber_decisions$judges_names <- as.character(chamber_decisions$judges_names)
sum(is.na(chamber_decisions$richter_txt))
#these are the problematic cases: very often, 2 judges are written in j1 j2 and j3
chamber_decisions$judges_names[is.na(chamber_decisions$richter_txt)]
#replace the NAs with the judge 1 and judge 2 names and bind strings together:
# chamber_decisions$richter_txt[is.na(chamber_decisions$richter_txt)] <- str_c(names_extract(as.character(chamber_decisions$judge1[is.na(chamber_decisions$richter_txt)])),
# names_extract(as.character(chamber_decisions$judge2[is.na(chamber_decisions$richter_txt)])))
#check the levels:
unique(levels(as.factor(chamber_decisions$richter_txt)))
#apply second cleaning:
chamber_decisions$richter_txt <- clean_names2(chamber_decisions$richter_txt)
#check the levels again:
unique(levels(as.factor(chamber_decisions$richter_txt)))
####some more checks:
# consolidate richter (from cite)
richter <- clean.names(chamber_decisions$richter_txt)
#richter[richter=="No names in the text"] <- clean.names(case$richter_cit)[richter=="No names in the text"]
richter[richter=="None"] <- NA
# no of richter
nrichter <- unlist(lapply(richter.split(richter), length))
table(nrichter)
chamber_decisions$nrichter <- nrichter
#write all problematic cases into vector to check them
test1 <- chamber_decisions$judges_names[chamber_decisions$nrichter != 3] %>% stri_na2empty() %>% stri_remove_empty()
list_all_names <- c("Baer", "Britz", "Broß", "Bryde",
"Eichberger", "Gerhardt", "Christ", "DiFabio", "Gaier", "Graßhof",
"Grimm", "Haas","Hermanns", "Harbarth", "Hassemer", "Hoffmann-Riem", "Hohmann-Dennhardt",
"Hömig", "Huber", "Jaeger", "Jentsch", "Kessal-Wulf", "Kirchhof", "König", "Kruis",
"Kühling", "Landau", "Langenfeld", "Limbach", "Lübbe-Wolff", "Maidowski", "Masing",
"Mellinghoff", "Müller", "Osterloh", "Ott", "Papier", "Paulus", "Radtke", "Schluckebier",
"Seidl", "Sommer", "Steiner", "Voßkuhle", "Winter")
#set the Schreibweise to all names the same:
list_all_names <- clean.names(list_all_names)
test1 <- clean.names(test1)
#check whether all unique names (important for matching later):
length(list_all_names) == length(unique(list_all_names))
correct_names <- list()
for(i in 1:length(test1)){
correct_names[[i]] <- str_c(str_extract_all(test1[i], list_all_names) %>% unlist(), collapse = ",")
}
#The last two are empty (Nothing found), so we have to set them NA to bind the vector the original data frame
correct_names[[61]] <- NA
correct_names[[62]] <- NA
#write into data set:
chamber_decisions$richter_txt[chamber_decisions$nrichter != 3] <- unlist(correct_names)
#now check njudges again:
richter <- clean.names(chamber_decisions$richter_txt)
# no of richter
nrichter <- unlist(lapply(richter.split(richter), length))
chamber_decisions$nrichter2 <- nrichter
table(nrichter)
```
There are 3 names which are still wrong (less than 3 judges in a panel). I will add these manually:
No header:
https://www.bundesverfassungsgericht.de/e/rk19990413_2bvr056299.html
```{r}
chamber_decisions$richter_txt[chamber_decisions$nrichter2 != 3] <- c("Hohmann-Dennhardt,Gaier,Kirchhof", "Limbach,Winter,Hassemer", "Limbach,Winter,Hassemer")
#check:
check_names <- str_split(chamber_decisions$richter_txt,",") %>% unlist()
unique(check_names)
#Still trailing whitespace:
#we have to remove them manually; strim or gsub \\s does not work.
check_names <- str_split(chamber_decisions$richter_txt,",") %>% unlist()
chamber_decisions$richter_txt <- gsub(unique(check_names)[33], "Lübbe-Wolff",
chamber_decisions$richter_txt)
chamber_decisions$richter_txt <- gsub(unique(check_names)[34], "Gerhardt",
chamber_decisions$richter_txt)
chamber_decisions$richter_txt <- gsub(unique(check_names)[40], "Hömig",
chamber_decisions$richter_txt)
chamber_decisions$richter_txt <- gsub(unique(check_names)[41], "Bryde",
chamber_decisions$richter_txt)
# chamber_decisions$richter_txt <- gsub(unique(check_names)[50], "Jaeger",
# chamber_decisions$richter_txt)
#
# chamber_decisions$richter_txt <- gsub(unique(check_names)[51], "Steiner",
# chamber_decisions$richter_txt)
#
# chamber_decisions$richter_txt <- gsub(unique(check_names)[53], "Kruis",
# chamber_decisions$richter_txt)
#
# chamber_decisions$richter_txt <- gsub(unique(check_names)[54], "Sommer",
# chamber_decisions$richter_txt)
#check again:
check_names <- str_split(chamber_decisions$richter_txt,",") %>% unlist()
unique(check_names)
#do we have all names that we manually identified?
length(list_all_names) == length(unique(check_names))
#Finally. Yesssss we have it.
```
```{r}
#some names are still wrong. This is because of inconsistencies on the page of the court: e.g. https://www.bundesverfassungsgericht.de/e/rk20071227_1bvr085306.html . Here, there is an unusual "\n" on the place where the names of the judges appear.
#https://www.bundesverfassungsgericht.de/e/qk20160730_1bvq002916.html repeats the names
#These are the cases where there are formatting errors on the page:
links_formatting_errors <- data.frame(
aktenzeichen = chamber_decisions$az[grepl("durch", chamber_decisions$richter_txt)],
header = chamber_decisions$header[grepl("durch", chamber_decisions$richter_txt)],
link = chamber_decisions$link_decision[grepl("durch", chamber_decisions$richter_txt)])
#write for GFCC:rfG, Urteil des Zweiten Sen
write.csv(links_formatting_errors, "links_formatting_errors.csv")
#######
```
Identify Einstweilige Anordnung (Provisional Orders):
```{r}
#Einstweilige Anordnung:
chamber_decisions$anordnung <- ifelse(grepl("BvQ", chamber_decisions$az), 1, 0)
```
Now we have all the information that we need for the chamber paper. We save the data set. Before, we drop irrelevant information.
```{r}
chamber_decisions <- chamber_decisions %>%
select(az, date, senat, kammer_txt, richter_txt, anordnung, link = link_decision)
cat("The data is clean: before writing the data set there are", sum(is.na(chamber_decisions)), "NAs in the data set")
save(chamber_decisions, file = "chamber_decision_cleaned.Rda")
write.csv(chamber_decisions, file = "chamber_decision_cleaned.csv", fileEncoding = "UTF-8")
```
You can of course easily extend this script by scraping additional information, such as how the overall outcome of the case was or who are the plaintiffs etc.