12.xhtml 8.3 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
  1. <?php
  2. /**
  3. * <https://y.st./>
  4. * Copyright © 2016 Alex Yst <mailto:copyright@y.st>
  5. *
  6. * This program is free software: you can redistribute it and/or modify
  7. * it under the terms of the GNU General Public License as published by
  8. * the Free Software Foundation, either version 3 of the License, or
  9. * (at your option) any later version.
  10. *
  11. * This program is distributed in the hope that it will be useful,
  12. * but WITHOUT ANY WARRANTY; without even the implied warranty of
  13. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  14. * GNU General Public License for more details.
  15. *
  16. * You should have received a copy of the GNU General Public License
  17. * along with this program. If not, see <https://www.gnu.org./licenses/>.
  18. **/
  19. $xhtml = array(
  20. 'title' => 'Moving forward again',
  21. 'body' => <<<END
  22. <p>
  23. Yesterday, I got almost nothing done, programing-wise.
  24. I think the problem is that I sat to work on boring wrapper classes and only boring wrapper classes.
  25. I wanted to make some progress on that so that I could be done with it sooner, but I need to throw in some actual usable stuff in between.
  26. I should focus more on my spider and only put the minimal effort that I promised into the wrapper classes.
  27. The wrapper classes will not be needed for quite some time, and if any of them become relevant before they are written, I can simply jump ahead and write the needed wrapper class.
  28. This should keep me more motivated to actually keep my code moving forward.
  29. </p>
  30. <p>
  31. As the first order of business today, I fixed the issue in which the spider will not recrawl the page that it was interrupted while crawling, though that will not help until the next time that it is interrupted or reaches the end of the current crawl.
  32. I forgot to yesterday though that the spider keep getting interrupted every time that my laptop disconnects from the server, even when I use the <code>disown</code> command or <code>&amp;</code> operator, so I was able to deploy the fix fairly quickly.
  33. The spider takes longer to start up as it gathers more information though, and I always worry that I have introduced bugs such as those that might prevent any output at all.
  34. At one point, I caused the spider to loop endlessly, so I added a few debug information output lines so that I can see where the script is looping.
  35. It is a relief to be able to see the difference between a hang in the script versus an unending loop with no output.
  36. The spider hung for hours this time when I started it though, and I ended up adding even more debug lines.
  37. As it turns out, my database has gotten too big for the spider to effectively query, and thus, the spider cannot know where to crawl next.
  38. I thought that I would be fine if I simply did not hang onto information needed for search engine queries, but it seems that even holding onto data used by the spider itself is too much.
  39. I have had to downsize what the spider stores, making it a bit less effective.
  40. Now, it will only store information about website index pages.
  41. It will have to depend on websites having internal links that make everything accessible from the main index, though it can take several hops if needed.
  42. If a found $a[URI] relates to a page on the same site, it will be crawled looking for more links to external sites, but if the found $a[URI] is found that points to a different website&apos;s page, the $a[URI] will be truncated to the root directory and stored in the database.
  43. </p>
  44. <p>
  45. I felt completely lost as to whether I need to rework the way that the crawling and database work first or implement support for other protocols first.
  46. In the end, I decided to learn more about the protocols so that I can parse the responses that I get from the servers.
  47. After I have built the tools that I need in order to understand server responses, I will set up the <code>switch()</code> statement for dealing with $a[URI]s of different schemes differently, rework the $a[HTTPS]/$a[HTTP]-handling code, and add the support of other schemes that $a[cURL] can handle.
  48. I started out by learning <a href="https://en.wikipedia.org/wiki/Gopher_%28protocol%29">Gopher</a> directory-listing syntax.
  49. With some help from someone that prefers not to be mentioned, as well as some queries to a known Gopher server, I was able to decypher the very basic syntax.
  50. As it turns out, Gopher directory listings are listings of resources on the same server, a different server, or a mix.
  51. Gopher only allows linking to three types of servers though, assuming I understand correctly: Gopher servers, Telnet servers, and tn3270 servers, with tn3270 being an ancient form of Telnet that has its own $a[URI] scheme.
  52. There is a <a href="https://en.wikipedia.org/wiki/Gopher_%28protocol%29#URL_links">hacky workaround</a> that allows linking to arbitrary $a[URI]s, but it seems very messy to me.
  53. Basically, it is implemented as a link to an $a[HTML] page that is on the same server as the directory listing, but the text of the link has a specific syntax.
  54. The client is expected to recognize that because the specific syntax is being used, the link is not actually to be taken at face value.
  55. The &quot;server&quot; and &quot;port&quot; information of the link is to be ignored, and information from the link text is to be used instead.
  56. If the client fails to understand this, it will request the $a[HTML] file that the link actually points to, though as a fallback, the $a[HTML] file sent to the client when it does so will usually be an $a[HTML] redirection page that redirects the client to the correct $a[URI].
  57. I do not like this special-case-type syntax, and will probably have the spider ignore it by simply not programming it to recognize the special case.
  58. I assume that the $a[HTML] redirect pages used use <code>&lt;meta/&gt;</code> refresh tags, so Gopher&apos;s use of those may prompt me to work on support for <code>&lt;meta/&gt;</code> refresh tags at some point soon.
  59. This support would apply to $a[XHTML]/$a[HTML] pages retrieved over $a[HTTPS]/$a[HTTP] as well.
  60. The other oddity in the Gopher directory listings is the informational message links.
  61. This links have a dummy host name and dummy port, due to the fact that they are not meant to to be links at all.
  62. However, as Gopher directory listings can contain only links, the only way to provide non-link information is by using such dummy information.
  63. Next, I looked into $a[FTP] support.
  64. $a[FTP] seems to return a directory listing of the files actually contained in a directory.
  65. There is no title information present though, and as far as I know, $a[FTP] is not typically used to set up website-like pages the way that Gopher is, so I will not find $a[XHTML] files with links to other sites.
  66. There is really no reason for me to crawl these pages.
  67. I might put such services that do not need to be crawled in a separate database than the services that should be crawled.
  68. I kind of wonder if I want the spider to crawl anything <strong>*besides*</strong> $a[HTTPS], $a[HTTPS], and Gopher pages.
  69. Other types of resources might be good to query for a name, such as $a[IRCS] servers, but I doubt that they will need to be recursively crawled.
  70. </p>
  71. <p>
  72. It appears that <a href="https://opalrwf4mzmlfmag.onion/">wowaname</a>&apos;s hosting service has gone down for the time being.
  73. She left her computer at her roommate&apos;s house over winter break due to not having a stable Internet connection at home and <a href="http://answerstedhctbek.onion/">her roommate&apos;s older brother unplugged the machine</a>.
  74. From the sound of it, wowaname&apos;s hosting is usually provided over her school&apos;s Internet connection.
  75. They do not mind if students use this connection for whatever they like, such as providing hosting services, but they do not allow students to keep anything plugged in over the break.
  76. </p>
  77. <p>
  78. Cyrus has less than two weeks before his time is up and it is too late for him to complete his Boy Scout project.
  79. He was going to have me run some paperwork around town while he is in school tomorrow to save himself some time, but opted against it, saying that he had more that needed to be done before the paperwork could be dealt with.
  80. I wish that there was something I could do to help take the stress off of him a bit, but every time i ask, he says that there is nothing that I can do to help him yet.
  81. </p>
  82. <p>
  83. My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
  84. </p>
  85. END
  86. );