521 lines
17 KiB

8 years ago
7 years ago
9 months ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
9 months ago
7 years ago
9 months ago
9 months ago
7 years ago
7 years ago
  1. XRay
  2. ====
  3. XRay parses structured content from a URL.
  4. ## Discovering Content
  5. XRay will parse content in the following formats. First the URL is checked against known services:
  6. * GitHub
  7. * XKCD
  8. * Hackernews
  9. If the contents of the URL is XML or JSON, then XRay will parse the Atom, RSS or JSONFeed formats.
  10. Finally, XRay looks for Microformats on the page and will determine the content from that.
  11. * h-card
  12. * h-entry
  13. * h-event
  14. * h-review
  15. * h-recipe
  16. * h-product
  17. * h-item
  18. * h-feed
  19. ## Library
  20. XRay can be used as a library in your PHP project. The easiest way to install it and its dependencies is via composer.
  21. ```
  22. composer require p3k/xray
  23. ```
  24. You can also [download a release](https://github.com/aaronpk/XRay/releases) which is a zip file with all dependencies already installed.
  25. ### Parsing
  26. ```php
  27. $xray = new p3k\XRay();
  28. $parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/');
  29. ```
  30. If you already have an HTML or JSON document you want to parse, you can pass it as a string in the second parameter.
  31. ```php
  32. $xray = new p3k\XRay();
  33. $html = '<html>....</html>';
  34. $parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/', $html);
  35. ```
  36. ```php
  37. $xray = new p3k\XRay();
  38. $jsonfeed = '{"version":"https://jsonfeed.org/version/1","title":"Manton Reece", ... ';
  39. // Note that the JSON document must be passed in as a string in this case
  40. $parsed = $xray->parse('https://manton.micro.blog/feed.json', $jsonfeed);
  41. ```
  42. In both cases, you can add an additional parameter to configure various options of how XRay will behave. Below is a list of the options.
  43. * `timeout` - The timeout in seconds to wait for any HTTP requests
  44. * `max_redirects` - The maximum number of redirects to follow
  45. * `include_original` - Will also return the full document fetched
  46. * `target` - Specify a target URL, and XRay will first check if that URL is on the page, and only if it is, will continue to parse the page. This is useful when you're using XRay to verify an incoming webmention.
  47. * `expect=feed` - If you know the thing you are parsing is a feed, include this parameter which will avoid running the autodetection rules and will provide better results for some feeds.
  48. * `accept` - (options: `html`, `json`, `activitypub`, `xml`) - Without this parameter, XRay sends a default `Accept` header to prioritize getting the most likely best result from a page. If you are parsing a page for a specific purpose and expect to find only one type of content (e.g. webmentions will probably only be from HTML pages), you can include this parameter to adjust the `Accept` header XRay sends.
  49. Additional parameters are supported when making requests that use the GitHub API. See the Authentication section below for details.
  50. The XRay constructor can optionally be passed an array of default options, which will be applied in
  51. addition to (and can be overridden by) the options passed to individual `parse()` calls.
  52. ```php
  53. $xray = new p3k\XRay([
  54. 'timeout' => 30 // Time-out all requests which take longer than 30s
  55. ]);
  56. $parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/', [
  57. 'timeout' => 40 // Override the default 30s timeout for this specific request
  58. ]);
  59. $parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/', $html, [
  60. 'target' => 'http://example.com/'
  61. ]);
  62. ```
  63. The `$parsed` return value will look like the below. See "Primary Data" below for an explanation of the vocabularies returned.
  64. ```
  65. $parsed = Array
  66. (
  67. [data] => Array
  68. (
  69. [type] => card
  70. [name] => Aaron Parecki
  71. [url] => https://aaronparecki.com/
  72. [photo] => https://aaronparecki.com/images/profile.jpg
  73. )
  74. [url] => https://aaronparecki.com/
  75. [code] => 200,
  76. [source-format] => mf2+html
  77. )
  78. ```
  79. ### Processing Microformats2 JSON
  80. If you already have a parsed Microformats2 document as an array, you can use a special function to process it into XRay's native format. Make sure you pass the entire parsed document, not just the single item.
  81. ```php
  82. $html = '<div class="h-entry"><p class="p-content p-name">Hello World</p><img src="/photo.jpg"></p></div>';
  83. $mf2 = Mf2\parse($html, 'http://example.com/entry');
  84. $xray = new p3k\XRay();
  85. $parsed = $xray->process('http://example.com/entry', $mf2); // note the use of `process` not `parse`
  86. Array
  87. (
  88. [data] => Array
  89. (
  90. [type] => entry
  91. [post-type] => photo
  92. [photo] => Array
  93. (
  94. [0] => http://example.com/photo.jpg
  95. )
  96. [content] => Array
  97. (
  98. [text] => Hello World
  99. )
  100. )
  101. [url] => http://example.com/entry
  102. [source-format] => mf2+json
  103. )
  104. ```
  105. ### Rels
  106. You can also use XRay to fetch all the rel values on a page, merging the list of HTTP `Link` headers with rel values with the HTML rel values on the page.
  107. ```php
  108. $xray = new p3k\XRay();
  109. $rels = $xray->rels('https://aaronparecki.com/');
  110. ```
  111. This will return a similar response to the parser, but instead of a `data` key containing the parsed page, there will be `rels`, an associative array. Each key will contain an array of all the values that match that rel value.
  112. ```
  113. Array
  114. (
  115. [url] => https://aaronparecki.com/
  116. [code] => 200
  117. [rels] => Array
  118. (
  119. [hub] => Array
  120. (
  121. [0] => https://switchboard.p3k.io/
  122. )
  123. [authorization_endpoint] => Array
  124. (
  125. [0] => https://aaronparecki.com/auth
  126. )
  127. ...
  128. ```
  129. ### Feed Discovery
  130. You can use XRay to discover the types of feeds available at a URL.
  131. ```php
  132. $xray = new p3k\XRay();
  133. $feeds = $xray->feeds('http://percolator.today');
  134. ```
  135. This will fetch the URL, check for a Microformats feed, as well as check for rel=alternates pointing to Atom, RSS or JSONFeed URLs. The response will look like the below.
  136. ```
  137. Array
  138. (
  139. [url] => https://percolator.today/
  140. [code] => 200
  141. [feeds] => Array
  142. (
  143. [0] => Array
  144. (
  145. [url] => https://percolator.today/
  146. [type] => microformats
  147. )
  148. [1] => Array
  149. (
  150. [url] => https://percolator.today/podcast.xml
  151. [type] => rss
  152. )
  153. )
  154. )
  155. ```
  156. ### Customizing the User Agent
  157. To set a unique user agent, (some websites will require a user agent be set), you can set the `http` property of the object to a `p3k\HTTP` object.
  158. ```php
  159. $xray = new p3k\XRay();
  160. $xray->http = new p3k\HTTP('MyProject/1.0.0 (http://example.com/)');
  161. $xray->parse('http://example.com/');
  162. ```
  163. ## API
  164. XRay can also be used as an API to provide its parsing capabilities over an HTTP service.
  165. To parse a page and return structured data for the contents of the page, simply pass a url to the `/parse` route.
  166. ```
  167. GET /parse?url=https://aaronparecki.com/2016/01/16/11/
  168. ```
  169. To conditionally parse the page after first checking if it contains a link to a target URL, also include the target URL as a parameter. This is useful when using XRay to verify an incoming webmention.
  170. ```
  171. GET /parse?url=https://aaronparecki.com/2016/01/16/11/&target=http://example.com
  172. ```
  173. In both cases, the response will be a JSON object containing a key of "type". If there was an error, "type" will be set to the string "error", otherwise it will refer to the kind of content that was found at the URL, most often "entry".
  174. You can also make a POST request with the same parameter names.
  175. If you already have an HTML or JSON document you want to parse, you can include that in the POST parameter `body`. This POST request would look like the below:
  176. ```
  177. POST /parse
  178. Content-type: application/x-www-form-urlencoded
  179. url=https://aaronparecki.com/2016/01/16/11/
  180. &body=<html>....</html>
  181. ```
  182. or for GitHub where you might have JSON,
  183. ```
  184. POST /parse
  185. Content-type: application/x-www-form-urlencoded
  186. url=https://github.com/aaronpk/XRay
  187. &body={"repo":......}
  188. ```
  189. ### Parameters
  190. XRay accepts the following parameters when calling `/parse`
  191. * `url` - the URL of the page to parse
  192. * `target` - Specify a target URL, and XRay will first check if that URL is on the page, and only if it is, will continue to parse the page. This is useful when you're using XRay to verify an incoming webmention.
  193. * `timeout` - The timeout in seconds to wait for any HTTP requests
  194. * `max_redirects` - The maximum number of redirects to follow
  195. * `include_original` - Will also return the full document fetched
  196. * `expect=feed` - If you know the thing you are parsing is a feed, include this parameter which will avoid running the autodetection rules and will provide better results for some feeds.
  197. ### Authentication
  198. If the URL you are fetching requires authentication, include the access token in the parameter "token", and it will be included in an "Authorization" header when fetching the URL. (It is recommended to use a POST request in this case, to avoid the access token potentially being logged as part of the query string.) This is useful for [Private Webmention](https://indieweb.org/Private-Webmention) verification.
  199. ```
  200. POST /parse
  201. url=https://aaronparecki.com/2016/01/16/11/
  202. &target=http://example.com
  203. &token=12341234123412341234
  204. ```
  205. ### API Authentication
  206. XRay uses the Github APIs to fetch posts, and those API require authentication. In order to keep XRay stateless, it is required that you pass in the credentials to the parse call.
  207. You should only send the credentials when the URL you are trying to parse is a GitHub URL, so you'll want to check for whether the hostname is `github.com`, etc. before you include credentials in this call.
  208. #### GitHub Authentication
  209. XRay uses the GitHub API to fetch GitHub URLs, which provides higher rate limits when used with authentication. You can pass a GitHub access token along with the request and XRay will use it when making requests to the API.
  210. * `github_access_token` - A GitHub access token
  211. ### Error Response
  212. ```json
  213. {
  214. "error": "not_found",
  215. "error_description": "The URL provided was not found"
  216. }
  217. ```
  218. Possible errors are listed below:
  219. * `not_found`: The URL provided was not found. (Returned 404 when fetching)
  220. * `ssl_cert_error`: There was an error validating the SSL certificate. This may happen if the SSL certificate has expired.
  221. * `ssl_unsupported_cipher`: The web server does not support any of the SSL ciphers known by the service.
  222. * `timeout`: The service timed out trying to connect to the URL.
  223. * `invalid_content`: The content at the URL was not valid. For example, providing a URL to an image will return this error.
  224. * `no_link_found`: The target link was not found on the page. When a target parameter is provided, this is the error that will be returned if the target could not be found on the page.
  225. * `no_content`: No usable content could be found at the given URL.
  226. * `unauthorized`: The URL returned HTTP 401 Unauthorized.
  227. * `forbidden`: The URL returned HTTP 403 Forbidden.
  228. ### Response Format
  229. ```json
  230. {
  231. "data":{
  232. "type":"entry",
  233. "post-type":"photo",
  234. "published":"2017-03-01T19:00:33-08:00",
  235. "url":"https://aaronparecki.com/2017/03/01/14/hwc",
  236. "category":[
  237. "indieweb",
  238. "hwc"
  239. ],
  240. "photo":[
  241. "https://aaronparecki.com/2017/03/01/14/photo.jpg"
  242. ],
  243. "syndication":[
  244. "https://twitter.com/aaronpk/status/837135519427395584"
  245. ],
  246. "content":{
  247. "text":"Hello from Homebrew Website Club PDX! Thanks to @DreamHost for hosting us! 🍕🎉 #indieweb",
  248. "html":"Hello from Homebrew Website Club PDX! Thanks to <a href=\"https://twitter.com/DreamHost\">@DreamHost</a> for hosting us! <a href=\"https://aaronparecki.com/emoji/%F0%9F%8D%95\">🍕</a><a href=\"https://aaronparecki.com/emoji/%F0%9F%8E%89\">🎉</a> <a href=\"https://aaronparecki.com/tag/indieweb\">#indieweb</a>"
  249. },
  250. "author":{
  251. "type":"card",
  252. "name":"Aaron Parecki",
  253. "url":"https://aaronparecki.com/",
  254. "photo":"https://aaronparecki.com/images/profile.jpg"
  255. }
  256. },
  257. "url":"https://aaronparecki.com/2017/03/01/14/hwc",
  258. "code":200,
  259. "source-format":"mf2+html"
  260. }
  261. ```
  262. #### Primary Data
  263. The primary object on the page is returned in the `data` property. This will indicate the type of object (e.g. `entry`), and will contain the vocabulary's properties that it was able to parse from the page.
  264. * `type` - the Microformats 2 vocabulary found for the primary object on the page, without the `h-` prefix (e.g. `entry`, `event`)
  265. * `post-type` - only for "posts" (e.g. not for `card`s) - the [Post Type](https://www.w3.org/TR/post-type-discovery/) of the post (e.g. (`note`, `photo`, `reply`))
  266. If a property supports multiple values, it will always be returned as an array. The following properties support multiple values:
  267. * `in-reply-to`
  268. * `like-of`
  269. * `repost-of`
  270. * `bookmark-of`
  271. * `follow-of`
  272. * `syndication`
  273. * `photo` (of an entry, not of a card)
  274. * `video`
  275. * `audio`
  276. * `category`
  277. The content will be an object that always contains a "text" property and may contain an "html" property if the source documented published HTML content. The "text" property must always be HTML escaped before displaying it as HTML, as it may include unescaped characters such as `<` and `>`.
  278. The author will always be set in the entry if available. The service follows the [authorship discovery](https://indieweb.org/authorship) algorithm to try to find the author information elsewhere on the page if it is not inside the entry in the source document.
  279. All URLs provided in the output are absolute URLs. If the source document contains a relative URL, it will be resolved first.
  280. #### Post Type Discovery
  281. XRay runs the [Post Type Discovery](https://www.w3.org/TR/post-type-discovery/) algorithm and also includes a `post-type` property.
  282. The following post types are returned, which are slightly expanded from what is currently documented by the Post Type Discovery spec.
  283. * `event`
  284. * `recipe`
  285. * `review`
  286. * `rsvp`
  287. * `repost`
  288. * `like`
  289. * `reply`
  290. * `bookmark`
  291. * `follow`
  292. * `checkin`
  293. * `video`
  294. * `audio`
  295. * `photo`
  296. * `article`
  297. * `note`
  298. #### Other Properties
  299. Other properties are returned in the response at the same level as the `data` property.
  300. * `url` - The effective URL that the document was retrieved from. This will be the final URL after following any redirects.
  301. * `code` - The HTTP response code returned by the URL. Typically this will be 200, but if the URL returned an alternate HTTP code that also included an h-entry (such as a 410 deleted notice with a stub h-entry), you can use this to find out that the original URL was actually deleted.
  302. * `source-format` - Indicates the format of the source URL that was used to generate the parsed result. Possible values are:
  303. * `mf2+html`
  304. * `mf2+json`
  305. * `feed+json`
  306. * `xml`
  307. * `github`/`xkcd`
  308. #### Feeds
  309. XRay can return information for several kinds of feeds. The URL (or body) passed to XRay will be checked for the following formats:
  310. * XML (Atom and RSS)
  311. * JSONFeed (https://jsonfeed.org)
  312. * Microformats [h-feed](https://indieweb.org/h-feed)
  313. If the page being parsed represents a feed, then the response will look like the following:
  314. ```json
  315. {
  316. "data": {
  317. "type": "feed",
  318. "items": [
  319. {...},
  320. {...}
  321. ]
  322. }
  323. }
  324. ```
  325. Each object in the `items` array will contain a parsed version of the item, in the same format that XRay normally returns. When parsing Microformats feeds, the [authorship discovery](https://indieweb.org/authorship) will be run for each item to build out the author info.
  326. Atom, RSS and JSONFeed will all be normalized to XRay's vocabulary, and only recognized properties will be returned.
  327. ## Rels API
  328. There is also an API method to parse and return all rel values on the page, including HTTP `Link` headers and HTML rel values.
  329. ```
  330. GET /rels?url=https://aaronparecki.com/
  331. ```
  332. See [above](#rels) for the response format.
  333. ## Feed Discovery API
  334. ```
  335. GET /feeds?url=https://aaronparecki.com/
  336. ```
  337. See [above](#feed-discovery) for the response format.
  338. ## Token API
  339. When verifying [Private Webmentions](https://indieweb.org/Private-Webmention#How_to_Receive_Private_Webmentions), you will need to exchange a code for an access token at the token endpoint specified by the source URL.
  340. XRay provides an API that will do this in one step. You can provide the source URL and code you got from the webmention, and XRay will discover the token endpoint, and then return you an access token.
  341. ```
  342. POST /token
  343. source=http://example.com/private-post
  344. &code=1234567812345678
  345. ```
  346. The response will be the response from the token endpoint, which will include an `access_token` property, and possibly an `expires_in` property.
  347. ```
  348. {
  349. "access_token": "eyJ0eXAXBlIjoI6Imh0dHB8idGFyZ2V0IjoraW0uZGV2bb-ZO6MV-DIqbUn_3LZs",
  350. "token_type": "bearer",
  351. "expires_in": 3600
  352. }
  353. ```
  354. If there was a problem fetching the access token, you will get one of the errors below in addition to the HTTP related errors returned by the parse API:
  355. * `no_token_endpoint` - Unable to find an HTTP header specifying the token endpoint.
  356. ## Installation
  357. ### From Source
  358. ```
  359. # Clone this repository
  360. git clone git@github.com:aaronpk/XRay.git
  361. cd XRay
  362. # Install dependencies
  363. composer install
  364. ```
  365. ### From Zip Archive
  366. * Download the latest release from https://github.com/aaronpk/XRay/releases
  367. * Extract to a folder on your web server
  368. ### Web Server Configuration
  369. Configure your web server to point to the `public` folder.
  370. Make sure all requests are routed to `index.php`. XRay ships with `.htaccess` files for Apache. For nginx, you'll need a rule like the following in your server config block.
  371. ```
  372. try_files $uri /index.php?$args;
  373. ```