123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899 |
- <?php
- /**
- * <https://y.st./>
- * Copyright © 2015 Alex Yst <mailto:copyright@y.st>
- *
- * This program is free software: you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation, either version 3 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program. If not, see <https://www.gnu.org./licenses/>.
- **/
- $xhtml = array(
- 'title' => 'Fixing my spider',
- 'body' => <<<END
- <p>
- This morning, the code at <a href="https://stackoverflow.com/questions/17641073/how-to-set-a-maximum-size-limit-to-php-curl-downloads">Stack Overflow</a> looked much more manageable than last night.
- I built a new class based on it that takes a size in bytes during instantiation, then uses that size as a reference when the object is called as a function.
- The example instead used a closure, but I think having a single line such as <code>CURLOPT_PROGRESSFUNCTION => new curl_limit(1024*1024),</code> is much more readable.
- I opted not to set <code>CURLOPT_BUFFERSIZE</code>, as I do not know what that is supposed to do.
- I think that it somehow makes $a[cURL] progress reports come more frequently, from what the question/answer page was saying, but I do not need that.
- I do not need an exact cut off point, just a way for the download to be prevented from going completely wild.
- I set the download limit to a full megabyte hoping that that would be a high enough limit to allow all regular Web pages to come through, and so far, that seems to be fine.
- No downloads were aborted aside from that singular problem file.
- After getting past that though, the spider quickly ran out of pages to crawl.
- It seems that this website does not link to any onion-based websites that link to many others.
- I will try linking to <a href="http://skunksworkedp2cg.onion/">Harry71's Onion Spider robot</a> to improve the results I get.
- I fear that this much input could jam something up on my end though.
- My own spider is not at all optimized and it keeps all its known onion addresses in memory at once.
- This entry will be put up prematurely so that i can continue my experiments.
- </p>
- <p>
- My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
- </p>
- <p>
- After some more testing, I realized that I needed a couple new features.
- The first breaks unneeded relationships between URIs by sorting the database before saving it.
- The second is more necessary for basic functionality though.
- I found that the spider was repeatedly requesting the same page Harry71's website using different $a[URI] fragments.
- We do not need to request the same page several times, nor do we want $a[URI] fragments to be in our database.
- A quick search of the existing database showed that there were only two fragments already in the database from the first successful run.
- One was a legitimate anchor that I embedded in a page, but the other was an error in my weblog.
- It confused me at first that the bulk of my anchors were not being found by the spider, but I soon realized that it is because those anchors are on pages accessible over the clearnet.
- </p>
- <p>
- These two new features in place, I ran the spider once more, starting again from the single-entry database that i had started with.
- Because the first-run database was created today, I figured it would be easier than trying to hand-edit the database to remove the two entries containing fragments.
- The database is currently stored as a serialized array, and last time that i tried editing one of those by had, I kept breaking it.
- After the spider ran a while, I noticed that it was converting some relative $a[URI]s to the incorrect absolute $a[URI]s.
- It seems that I need to account for the special case of files being present in the website's document root while also linking to relative $a[URI]s.
- I took care of that error, and while I was at it, added a configurable user agent string.
- On the next run, quite a ways in, I found that the <code><base/></code> tag was not being properly handles, and once again, I had to restart the spider over.
- </p>
- <p>
- While speaking with <a href="http://zdasgqu3geo7i7yj.onion/">theunknownman</a> on <a href="ircs://volatile.ch:6697/">Volatile</a>, he asked how my website was put together, so I separated the personal stuff that I do not want in a clearnet repository out into a separate repository, bound the two repositories with symbolic links, and uploaded the <a href="https://notabug.org/y.st./authorednansyxlu.onion.">main compile scripts and templates</a>.
- </p>
- <p>
- On a more serious note, <a href="https://wowana.me/">wowaname</a> and lucy are hassling theunknownman on <a href="ircs://irc.volatile.ch:6697/%23Volatile">#Volatile</a>.
- Theunknownman had some sort of technical issue with his $a[VPN], and instead of preventing an $a[IRC] connection from being established, his machine connected to the network over clearnet.
- Now they are flaunting the fact that they have his home $a[IP] address, much to his terror.
- He thinks that they will actually do something to him now that they know where he is.
- I do not think that theunknownman is in any real danger, but this shows just how much of a troll wowaname can be.
- She has most of the channel hassling theunknownman simply because she can.
- </p>
- <p>
- I am hanging out with a band of trolls.
- I need to find better company to keep.
- It is difficult though when most places maliciously discriminate against $a[Tor] users.
- It seems that trolls are pushed into the few places that allow $a[Tor] use, as they use $a[Tor] to evade bans.
- Those of us that do not evade bans, and if fact have done nothing to get banned, get blocked as collateral damage.
- </p>
- <p>
- Yesterday, the letter saying that mail having my surname would be forwarded to out new address finally came.
- Today, we actually received our forwarded mail too, complete with spam.
- It seems that the mail forwarding has been set up successfully.
- </p>
- <p>
- My end-of-day progress with the spider did not go as planned.
- It got stuck on <code>costeirahx33fpqu.onion</code> for some reason.
- I waited several hours, but it would not budge.
- I will need to tinker with it more tomorrow to try getting past that issue.
- I might set a time-based timeout to take care of it, as I do not think that this was a large file issue like last time.
- </p>
- <p>
- I learned something interesting from synapt of <a href="ircs://irc.oftc.net:6697/%23php">#php</a> today.
- Apparently, the architects behind $a[PHP] were not even halfway done writing $a[PHP]6 and people were already writing documentation and even books about how to code in it.
- This documentation and these books were obviously inaccurate, as there was no way to know yet how $a[PHP]6 would turn out, so $a[PHP]6 was canceled altogether to to avoid the confusion that these people had caused.
- Now, the developers are instead working on $a[PHP]7, giving it the features that $a[PHP]6 was going to have.
- </p>
- END
- );
|