forked from mitchellkrogza/nginx-ultimate-bad-bot-blocker
-
Notifications
You must be signed in to change notification settings - Fork 0
/
findfakebots.sh
313 lines (282 loc) · 13.3 KB
/
findfakebots.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
#!/bin/sh
# FIND FAKE GOOGLEBOTS AND BINGBOTS FROM NGINX SERVER LOG FILES
# Created by: Mitchell Krog ([email protected])
# Copyright: Mitchell Krog - https://github.com/mitchellkrogza
# Repo Url: https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker
##############################################################################
# _ __ _ #
# / |/ /__ _(_)__ __ __ #
# / / _ `/ / _ \\ \ / #
# /_/|_/\_, /_/_//_/_\_\ #
# __/___/ __ ___ __ ___ __ __ #
# / _ )___ ____/ / / _ )___ / /_ / _ )/ /__ ____/ /_____ ____ #
# / _ / _ `/ _ / / _ / _ \/ __/ / _ / / _ \/ __/ '_/ -_) __/ #
# /____/\_,_/\_,_/ /____/\___/\__/ /____/_/\___/\__/_/\_\\__/_/ #
# #
##############################################################################
# ------------------------------------------------------------------------------
# MIT License
# ------------------------------------------------------------------------------
# Copyright (c) 2017 Mitchell Krog - [email protected]
# https://github.com/mitchellkrogza
# ------------------------------------------------------------------------------
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# ------------------------------------------------------------------------------
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# ------------------------------------------------------------------------------
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# ------------------------------------------------------------------------------
# ---------------
# WHAT THIS DOES?
# ---------------------------------------------------------------------------------------------------------
# It extracts every single log line from all log files which claim to be Googlebot / bingbot
# This includes all valid Google and Bing bots too.
# These are extracted from your logs into new temporary log files.
# These files are then processed with some magic to find only the fake bots which are then emailed to you.
# After the script has run and emailed you all temporary files are cleaned up and you original log files
# are not touched or modified in any way whatsoever.
#
# It is lightning fast !!!
# 33.946s from start to finish for a full months worth of log files from 40 web sites.
#
# THIS SCRIPT WILL PROCESS ALL CURRENT LOG FILES IN YOUR NGINX LOG FILE LOCATION
# This means it ONLY processes this months "current" log files which are xxxxxxxx-access.log
# It will NOT process rolled over log files ie xxxxxxxx-access.log.1 and xxxxxxx-access.log.2
# It is pointless looking for Fake Bots in older logs anyway as these guys change IP's frequently.
#
# This script does NOT touch or modify ANY of your real log files.
# ---------------------------------------------------------------------------------------------------------
# ----------------------
# REQUIREMENTS AND NOTES
# ----------------------
# - mutt (sending emails) - sudo apt install mutt
# - awk
# - nawk
# - sed
# - dig
# - USES: ANY existing Nginx log format that starts with '$remote_addr'
# ----------------------
# INSTALLATION AND USAGE
# ----------------------
# --------------------------------------------------------
# 1. STOP Mutt from storing all sent emails
# otherwise it creates an ever growing file called "sent"
# --------------------------------------------------------
#
# sudo nano /etc/Muttrc
#
# ---------------------------------------
# 2. PASTE this at the bottom of the file
# ---------------------------------------
#
# set copy = no
# set folder = ""
#
# ----------------------------------------------------------
# 3. SAVE this script in your HOME folder as findfakebots.sh
# ----------------------------------------------------------
#
# ------------------------------
# 4. MAKE this script executable
# ------------------------------
#
# sudo chmod +x findfakebots.sh
#
# -------------------------------------
# 5. EDIT the USER SETTINGS block below
# -------------------------------------
#
# ---------------------------
# 6. RUN the script with sudo
# ---------------------------
#
# cd ${HOME}
# sudo ./findfakebots.sh
#
# RUN FROM CRON as you like, make sure you have allowed your user to run sudo from CRON through visudo !!!
# You should only need to run this perhaps once a week.
#
# --------------------------
# 7. REPORTING YOUR FINDINGS
# --------------------------
# ----------------------------------------------------------------------------------------------------------------------------------------------------
# When you see the email you will receive you will see a list of IP's detected and below that a list of the same IP's with their reverse DNS Names.
# Before you report them in this repo as issues you need to first get the whois details of each and log ONLY ONE IP per issue.
#
# See example: https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/issues/293
#
# Your issue MUST include:
# - the whois output from https://www.ultratools.com/tools/ipWhoisLookupResult
# - An excerpt from your log file
# - DO NOT log issues with any IP's that resolve with 'dynamic' or 'adsl' in the reverse lookup it is pointless blocking dynamic addresses.
# ----------------------------------------------------------------------------------------------------------------------------------------------------
# -------------
# USER SETTINGS
# -------------
recipient="" # < ADD your own email address between the ""
nginxlogslocation=/var/log/nginx # < Location of your nginx log directory
# -----------------
# END USER SETTINGS
# -----------------
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# ------------------------------------
# DONT MODIFY ANYTHING BELOW THIS LINE
# ------------------------------------
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# ---------
# VARIABLES
# ---------
datenow=$(date +%F)
timenow=$(date +%T)
# -----------------
# TEMP FILES WE USE
# -----------------
googlelog=${nginxlogslocation}/googlebots.log
googlefile=${HOME}/googlebots.list
googleemailfile=${HOME}/fakegooglebots.txt
googletestfile=${HOME}/googlebots.tested
googlefakefile=${HOME}/googlebots.fake
binglog=${nginxlogslocation}/bingbots.log
bingfile=${HOME}/bingbots.list
bingemailfile=${HOME}/fakebingbots.txt
bingtestfile=${HOME}/bingbots.tested
bingfakefile=${HOME}/bingbots.fake
tempfile=${HOME}/file.tmp
# -----------------------------
# PROCESS ALL CURRENT LOG FILES
# -----------------------------
cd ${nginxlogslocation}
# FIND ALL GOOGLEBOTS AND WRITE THEM TO A NEW LOG FILE
for logfile in $(find . -type f -name '*access.log' -a ! -name '*access.log.'); do
grep 'compatible\; Googlebot\/' ${logfile} >> ${googlelog}
done
# FIND ALL BINGBOTS AND WRITE THEM TO A NEW LOG FILE
for logfile in $(find . -type f -name '*access.log' -a ! -name '*access.log.'); do
grep 'compatible\; bingbot\/' ${logfile} >> ${binglog}
done
# -----------------------------
# FIND AND TEST FAKE GOOGLEBOTS
# -----------------------------
# Prepare Latest File GoogleBots File
nawk '{print $1}' ${googlelog} > ${tempfile} && mv ${tempfile} ${googlefile}
# Sort the File for Duplicates
sort -u ${googlefile} -o ${googlefile}
# Test Each Fake Bot
while read line
do
echo $line - `dig -x "$line" +short`
done < ${googlefile} > ${googletestfile}
# Print all Reverse DNS Results NOT Containing "Google" ie. Possible FAKE BOTS
awk '!/google/' ${googletestfile} > ${googlefakefile}
# Prepare our Email File
# Print list of IP's only first
nawk '{print $1}' ${googlefakefile} > ${tempfile} && mv ${tempfile} ${googleemailfile}
# Sort the File
sort -u ${googleemailfile} -o ${googleemailfile}
# Add Extra Info to Email File this section at bottom of email file includes the Reverse DNS Names we looked up
printf '\n-----------------------------------\nIP ADDRESSES WITH REVERSE DNS NAMES\n-----------------------------------\n\n' >> ${googleemailfile}
awk -F "-" '{print $1,$NF}' ${googlefakefile} >> ${googleemailfile}
# Print Message Date and Time at Top of Email
sed -i "1s/^/Possible Fake GoogleBots Detected\n$datenow - $timenow\n---------------------------------\n\n---------------------------------\nIP ADDRESSES FOUND\n---------------------------------\n/" ${googleemailfile}
# If our File is Empty we do ot Send an Email, OPtherwise we send the email
if [ -s "$googleemailfile" ]
then
# Email Me the Files
echo "Fake GoogleBots" | mutt -s "Fake GoogleBots" -a ${googleemailfile} -- ${recipient}
else
:
fi
# ---------------------------------
# END FIND AND TEST FAKE GOOGLEBOTS
# ---------------------------------
# ---------------------------
# FIND AND TEST FAKE BINGBOTS
# ---------------------------
# Prepare Latest File BingBots File
nawk '{print $1}' ${binglog} > ${tempfile} && mv ${tempfile} ${bingfile}
# Sort the File for Duplicates
sort -u ${bingfile} -o ${bingfile}
# Test Each Fake Bot
while read line
do
echo $line - `dig -x "$line" +short`
done < ${bingfile} > ${bingtestfile}
# Print all Reverse DNS Results NOT Containing "Google" ie. Possible FAKE BOTS
awk '!/msn/' ${bingtestfile} > ${bingfakefile}
# Prepare our Email File
# Print list of IP's only first
nawk '{print $1}' ${bingfakefile} > ${tempfile} && mv ${tempfile} ${bingemailfile}
# Sort the File
sort -u ${bingemailfile} -o ${bingemailfile}
# Add Extra Info to Email File this section at ottom of email file includes the Reverse DNS Names we looked up
printf '\n-----------------------------------\nIP ADDRESSES WITH REVERSE DNS NAMES\n-----------------------------------\n\n' >> ${bingemailfile}
awk -F "-" '{print $1,$NF}' ${bingfakefile} >> ${bingemailfile}
# Print Message Date and Time at Top of Email
sed -i "1s/^/Possible Fake BingBots Detected\n$datenow - $timenow\n---------------------------------\n\n---------------------------------\nIP ADDRESSES FOUND\n---------------------------------\n/" ${bingemailfile}
# If our File is Empty we do ot Send an Email, OPtherwise we send the email
if [ -s "$bingemailfile" ]
then
# Email Me the Files
echo "Fake BingBots" | mutt -s "Fake Bing Bots" -a ${bingemailfile} -- ${recipient}
else
:
fi
# -------------------------------
# END FIND AND TEST FAKE BINGBOTS
# -------------------------------
# ---------------------------------------------------
# DELETE ALL TEMP FILES
# ---------------------------------------------------
# This does NOT delete any of your real server logs
# only the temp log files created by this script
# ---------------------------------------------------
sudo rm ${googlelog}
sudo rm ${binglog}
sudo rm ${googlefile}
sudo rm ${googleemailfile}
sudo rm ${googletestfile}
sudo rm ${googlefakefile}
sudo rm ${bingfile}
sudo rm ${bingemailfile}
sudo rm ${bingtestfile}
sudo rm ${bingfakefile}
# ----------------------
# EXIT WITH ERROR NUMBER
# ----------------------
exit ${?}
# ------------------------------------------------------------------------------
# MIT License
# ------------------------------------------------------------------------------
# Copyright (c) 2017 Mitchell Krog - [email protected]
# https://github.com/mitchellkrogza
# ------------------------------------------------------------------------------
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# ------------------------------------------------------------------------------
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# ------------------------------------------------------------------------------
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# ------------------------------------------------------------------------------