I would like to get a timetable from my schools website and use it in a script to set automatic alerts but I don't know how.
So it seems my school uses FullCalendar to set the timetable so the times aren't HTML tags in the .html file.
I would like to get a timetable from my schools website and use it in a script to set automatic alerts but I don't know how.
So it seems my school uses FullCalendar to set the timetable so the times aren't HTML tags in the .html file.
As we don't have the real website you want to scrape data from, and website scraping is always different if you don't have some standardized API, it's not possible to give a 100% working solution. But I'll try to explain a way to get to your information.
fullcalender.io
is Javascript based, the events are set up as Javascript object or may be imported from json
format. If the latter is the case, you can easily just download the ready json
file that is referred to somewhere in the Javascript source code. Regarding parsing json
, there are many Questions and Answers around here.
If it's set up as a Javascript Object, you can just parse the .js
file or if it's included in a html <script>
tag, parse the html for the $('#calendar').fullCalendar(
object.
We can use curl
to get the website, then extract the information using e.g. awk
.
I made a small script to get the object for the fullcalender.io
Basic View demo. Your script may look similar.
curl -s https://fullcalendar.io/releases/fullcalendar/3.9.0/demos/basic-views.html \
| awk '/\.fullCalendar\(\{/{s=1; print "{"; next;};
/\}\)\;/{s=0};
s{print};
END{print "}";}'
Explanation:
/\.fullCalendar\(\{/{s=1; print "{"; next;};
Searches .fullCalender({
and if found sets variable s=1
and prints {
/\}\)\;/{s=0};
Searches for )};
and sets variable s=0
s{print};
prints the line if s
is set and not 0.END{print "}";}'
prints the }
at the end.Output:
{
header: {
left: 'prev,next today',
center: 'title',
right: 'month,basicWeek,basicDay'
},
defaultDate: '2018-03-12',
navLinks: true, // can click day/week names to navigate views
editable: true,
eventLimit: true, // allow "more" link when too many events
events: [
{
title: 'All Day Event',
start: '2018-03-01'
},
{
title: 'Long Event',
start: '2018-03-07',
end: '2018-03-10'
},
{
id: 999,
title: 'Repeating Event',
start: '2018-03-09T16:00:00'
},
{
id: 999,
title: 'Repeating Event',
start: '2018-03-16T16:00:00'
},
{
title: 'Conference',
start: '2018-03-11',
end: '2018-03-13'
},
{
title: 'Meeting',
start: '2018-03-12T10:30:00',
end: '2018-03-12T12:30:00'
},
{
title: 'Lunch',
start: '2018-03-12T12:00:00'
},
{
title: 'Meeting',
start: '2018-03-12T14:30:00'
},
{
title: 'Happy Hour',
start: '2018-03-12T17:30:00'
},
{
title: 'Dinner',
start: '2018-03-12T20:00:00'
},
{
title: 'Birthday Party',
start: '2018-03-13T07:00:00'
},
{
title: 'Click for Google',
url: 'http://google.com/',
start: '2018-03-28'
}
]
}
You can then parse the JS object to a JSON object using python and demjson
:
Install demjson
:
pip3 install demjson
and then run this:
curl -s https://fullcalendar.io/releases/fullcalendar/3.9.0/demos/basic-views.html \
| awk '/\.fullCalendar\(\{/{s=1; print "{"; next;};
/\}\)\;/{s=0};
s{print};
END{print "}";}' \
| python3 -c "import demjson, sys, json; print(json.dumps(demjson.decode('\n'.join(sys.stdin.readlines()))));" \
| jq ".events"
From here it should be fairly easy to move on using jq
. Of course instead of bash
and jq
you can do the whole thing in Python
.
jq
which should be the preferred tool to parse json
. Don't use awk
for that ... It will only cause pain ...
– pLumo
Dec 13 '18 at 14:30
The websync bash script uses wget
to retrieve answers here in Ask Ubuntu. It searches HTML tags to find Question Upvotes and Answer Upvotes. It converts special HTML symbols such as &
to &
and <
to <
, etc.
Here are a few snippets from the code you may find helpful:
LineOut=""
HTMLtoText () {
LineOut=$1 # Parm 1= Input line
LineOut="${LineOut//&/&}"
LineOut="${LineOut//</<}"
LineOut="${LineOut//>/>}"
LineOut="${LineOut//"/'"'}"
LineOut="${LineOut//'/"'"}"
LineOut="${LineOut//“/'"'}"
LineOut="${LineOut//”/'"'}"
} # HTMLtoText ()
Ampersand=$'\046'
(... SNIP LINES ...)
while IFS= read -r Line; do
(... SNIP LINES ...)
# Convert HTML codes to normal characters
HTMLtoText $Line
Line="$LineOut"
(... SNIP LINES ...)
done < "/tmp/$AnswerID"
(... SNIP LINES ...)
wget -O- "${RecArr[$ColWebAddr]}" > "/tmp/$AnswerID"
if [[ "$?" -ne 0 ]] # check return code for errors
then
# Sometimes a second attempt is required. Not sure why.
wget -O- "${RecArr[$ColWebAddr]}" > "/tmp/$AnswerID"
fi
if [[ "$?" == 0 ]] # check return code for errors
then
echo "$BarNo:100" > "$PercentFile"
echo "$BarNo:#Download completed." > "$PercentFile"
else
echo "$BarNo:100" > "$PercentFile"
echo "$BarNo:#Download error." > "$PercentFile"
echo "ERROR: $AnswerID" >> ~/websync.log
return 1
fi