Computer >> 컴퓨터 >  >> 프로그램 작성 >> Ruby

첫 번째 Web Scraper 만들기, 2부

이 튜토리얼에서는 Mechanize를 사용하여 링크를 클릭하고, 양식을 작성하고, 파일을 업로드하는 방법을 배웁니다. 또한 Mechanize 페이지 개체를 분할하는 방법과 Google 검색을 자동화하고 결과를 저장하는 방법을 배우게 됩니다.

주제

  • 단일 페이지와 페이지 매김
  • 기계화
  • 에이전트
  • 페이지
  • 노코기리 기법
  • 링크
  • 클릭
  • 양식

단일 페이지 대 페이지 매김

지금까지 우리는 Nokogiri를 사용하여 한 페이지의 화면을 긁는 방법을 알아내는 데 시간을 보냈습니다. 이것은 한 단계 더 나아가 여러 페이지에서 콘텐츠를 추출하는 방법을 배울 수 있는 좋은 기반이었습니다.

결국, 우리가 해결하려고 하는 문제는 140개 이상의 에피소드에서 콘텐츠를 얻는 것과 관련이 있습니다. 이는 단일 웹 페이지에 합당하게 들어갈 수 있는 것보다 더 많은 콘텐츠입니다. 우리는 페이지 매김으로 작업해야 하며 토끼 구멍 아래로 콘텐츠를 따라가는 방법을 알아내야 합니다.

여기서 Nokogiri가 멈추고 Mechanize라는 또 다른 유용한 보석이 작동합니다.

기계화

기계화(Mechanize)는 많은 장점을 제공하는 또 다른 강력한 도구입니다. 기본적으로 콘텐츠를 추출해야 하는 웹 사이트와의 상호 작용을 자동화할 수 있습니다. 그런 의미에서 Capybara로 테스트하면서 알 수 있는 몇 가지 기능을 생각나게 합니다.

오해하지 마세요. 한 페이지에서 Nokogiri로 노는 것 자체는 훌륭하지만 더 매운 데이터 추출 작업을 위해서는 조금 더 많은 마력이 필요합니다. 우리는 본질적으로 우리가 필요로 하는 만큼 많은 페이지를 크롤링하고 그 요소와 상호 작용하여 인간 행동을 모방하고 자동화할 수 있습니다. 꽤 강력한 물건!

이 보석을 사용하면 링크를 따라가고, 양식 필드를 채우고, 해당 데이터를 제출할 수 있습니다. 심지어 쿠키를 다루는 것도 테이블 위에 있습니다. 즉, 비공개 세션에 대한 사용자 로그인을 모방하고 액세스 권한이 있는 사이트에서 콘텐츠를 가져올 수도 있습니다.

자격 증명으로 로그인 정보를 입력하고 Mechanize에 따라 방법을 알려줍니다. 링크를 클릭하고 양식을 제출할 수 있기 때문에 이 도구로 할 수 없는 일은 거의 없습니다. 노코기리와 밀접한 관계가 있고, 그것에 의존하기도 한다. Aaron Patterson은 이 아름다운 보석의 저자 중 한 명입니다.

Mechanize 에이전트 인스턴스화

기계화를 시작하기 전에 Mechanize 에이전트를 인스턴스화해야 합니다.

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

agent Nokogiri에서 했던 것과 유사하게 페이지를 가져오는 데 사용됩니다.

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

podcast_url = "https://betweenscreens.fm/"

page = agent.get(podcast_url)

여기서 일어나는 일은 Mechanize 에이전트가 팟캐스트 페이지와 해당 쿠키를 얻었다는 것입니다.

페이지 콘텐츠 추출

이제 추출할 준비가 된 페이지가 있습니다. 그렇게 하기 전에 inspect를 사용하여 내부를 살펴보는 것이 좋습니다. 방법.

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

podcast_url = "https://betweenscreens.fm/"

page = agent.get(podcast_url)

puts page.inspect

출력은 상당히 중요합니다. Mechanize::Page 개체로 구성됩니다. 여기에서 해당 페이지의 모든 속성을 볼 수 있습니다.

저에게 이것은 추출하려는 데이터를 분할하는 데 정말 편리한 개체입니다.

출력

#<Mechanize::Page
 {url #https://betweenscreens.fm/>}
 {meta_refresh}
 {title "Between | Screens "}
 {iframes
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/290328784&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/290126141&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/289018386&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/287425105&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/287105342&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/221003494&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "">https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/218101809&color=ff0000&auto...}
 {frames}
 {links
  #<Mechanize::Page::Link "Logo cube" "/">
  #https://github.com/vis-kid/betweenscreens">
  #<Mechanize::Page::Link "about" "pages/about/">
  #<Mechanize::Page::Link "design" "design/">
  #<Mechanize::Page::Link "code" "code/">
  #<Mechanize::Page::Link "Randy J. Hunt" "episodes/144/">
  #<Mechanize::Page::Link "Jason Long" "episodes/143/">
  #<Mechanize::Page::Link "David Heinemeier Hansson" "episodes/142/">
  #<Mechanize::Page::Link "Zach Holman" "episodes/141/">
  #<Mechanize::Page::Link "Joel Glovier" "episodes/140/">
  #<Mechanize::Page::Link "João Ferreira" "episodes/139/">
  #<Mechanize::Page::Link "Corwin Harrell" "episodes/138/">
  #<Mechanize::Page::Link "Older Stuff »" "page/2/">
  #<Mechanize::Page::Link "Exercise" "/tags/exercise/">
  #<Mechanize::Page::Link "Company benefits" "/tags/company-benefits/">
  #<Mechanize::Page::Link "Tmux" "/tags/tmux/">
  #<Mechanize::Page::Link "FileTask" "/tags/filetask/">
  #<Mechanize::Page::Link "Decision making" "/tags/decision-making/">
  #<Mechanize::Page::Link "Favorite feature" "/tags/favorite-feature/">
  #<Mechanize::Page::Link "Working out" "/tags/working-out/">
  #<Mechanize::Page::Link "Scott Savarie" "/tags/scott-savarie/">
  #<Mechanize::Page::Link "Titles" "/tags/titles/">
  #<Mechanize::Page::Link "Erik Spiekermann" "/tags/erik-spiekermann/">
  #<Mechanize::Page::Link "Newbie mistakes" "/tags/newbie-mistakes/">
  #<Mechanize::Page::Link "Playbook" "/tags/playbook/">
  #<Mechanize::Page::Link "Delegation" "/tags/delegation/">
  #<Mechanize::Page::Link "Heat maps" "/tags/heat-maps/">
  #<Mechanize::Page::Link "Europe" "/tags/europe/">
  #<Mechanize::Page::Link "Sizing type" "/tags/sizing-type/">
  #<Mechanize::Page::Link "Focus" "/tags/focus/">
  #<Mechanize::Page::Link "Virtual assistants" "/tags/virtual-assistants/">
  #<Mechanize::Page::Link "Writing" "/tags/writing/">
  #<Mechanize::Page::Link "Hacking" "/tags/hacking/">
  #<Mechanize::Page::Link "Joel Glovier" "/tags/joel-glovier/">
  #<Mechanize::Page::Link "Corwin Harrell" "/tags/corwin-harrell/">
  #<Mechanize::Page::Link "Mario C. Delgado" "/tags/mario-c-delgado/">
  #<Mechanize::Page::Link "Tom Dale" "/tags/tom-dale/">
  #<Mechanize::Page::Link "Obie Fernandez" "/tags/obie-fernandez/">
  #<Mechanize::Page::Link "Chad Pytel" "/tags/chad-pytel/">
  #<Mechanize::Page::Link "Zach Holman" "/tags/zach-holman/">
  #<Mechanize::Page::Link "Max Luster" "/tags/max-luster/">
  #<Mechanize::Page::Link "Kyle Fiedler" "/tags/kyle-fiedler/">
  #<Mechanize::Page::Link "Roberto Machado" "/tags/roberto-machado/">}
 {forms}>

HTML 페이지 자체를 살펴보고 싶다면 body 또는 content 행동 양식.

some_scraper.rb

...

print page.body

...

출력

<!doctype html>

<html>
  <head>
    <meta charset="utf-8" />
    <meta http-equiv='X-UA-Compatible' content='IE=edge;chrome=1' />
    <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
    <meta name="viewport" content="initial-scale=1">
    <title>Between | Screens </title>
    <link rel="alternate" type="application/atom+xml" title="Atom Feed" href="/feed.xml" />
    <link href="stylesheets/all-11b45acc.css" rel="stylesheet" />
    <script src="javascripts/all-4c20da82.js"></script>
  </head>

  <body>
    <header>
      <div id="logo">
        <a href="/"><img src="images/Between_Screens_Logo_Cube_Up-539d6997.svg" alt="Logo cube" /></a>
      </div>
      <nav class="navigation">
        <ul class="nav-list"> 
fork">https://github.com/vis-kid/betweenscreens">fork!
          <li><a href="pages/about/">about</a></li>
          <li><a href="design/">design</a></li>
          <li><a href="code/">code</a></li>
        </ul>
      </nav>
    </header>

    <div id="main" role="main">
      <div class='posts'>
        <ul>
          <li>
            <article class="index-article">
              <span class='post-date'>Oct 27 | 2016</span><h2 class='post-title'><a href="episodes/144/">Randy J. Hunt</a></h2>
              <h3 class='topic-list'>Organizing teams | Diversity | Desires | Pizza rule | Effective over clever | Novel solutions | Straightforwardness | Research | Coffeeshop test | Small changes | Reducing errors | Granular diffs</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                  height="166"
                  scrolling="no"
                  frameborder="no"
                  src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/290328784&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Oct 25 | 2016</span><h2 class='post-title'><a href="episodes/143/">Jason Long</a></h2>
              <h3 class='topic-list'>Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                height="166"
                scrolling="no"
                frameborder="no"
                src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/290126141&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Oct 18 | 2016</span><h2 class='post-title'><a href="episodes/142/">David Heinemeier Hansson</a></h2>
              <h3 class='topic-list'>Rails community | Tone | Technical disagreements | Community policing | Ungratefulness | No assholes allowed | Basecamp | Open source persona | Aspirations | Guarding motivations | Dealing with audiences | Pressure | Honesty | Diverse opinions | Small talk</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                height="166"
                scrolling="no"
                frameborder="no"
                src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/289018386&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Oct 12 | 2016</span><h2 class='post-title'><a href="episodes/141/">Zach Holman</a></h2>
              <h3 class='topic-list'>Getting Fired | Taboo | Transparency | Different Perspectives | Timing | Growth Stages | Employment & Dating | Managers | At-will Employment | Tech Industry | Europe | Low hanging Fruits | Performance Improvement Plans | Meeting Goals | Surprise Firings | Firing Fast | Mistakes | Company Culture | Communication</h3>
              <div class='soundcloud-player-small'>  
                <iframe width="100%"
                  height="166"
                  scrolling="no"
                  frameborder="no"
                  src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/287425105&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Oct 10 | 2016</span><h2 class='post-title'><a href="episodes/140/">Joel Glovier</a></h2>
              <h3 class='topic-list'>Digital Product Design | Product Design @ GitHub | Loving Design | Order & Chaos | Drawing | Web Design | HospitalRun | Diversity | Startup Culture | Improving Lives | CURE International | Ember | Offline First | Hospital Information System | Designers & Open Source</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                  height="166"
                  scrolling="no"
                  frameborder="no"
                  src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/287105342&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Aug 26 | 2015</span><h2 class='post-title'><a href="episodes/139/">João Ferreira</a></h2>
              <h3 class='topic-list'>Masters @ Work | Subvisual | Deadlines | Design personality | Design problems | Team | Pushing envelopes | Delightful experiences | Perfecting details | Company values</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                height="166"
                scrolling="no"
                frameborder="no"
                src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/221003494&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Aug 06 | 2015</span><h2 class='post-title'><a href="episodes/138/">Corwin Harrell</a></h2>
              <h3 class='topic-list'>Q&A | 01 | University | Graphic design | Design setup | Sublime | Atom | thoughtbot | Working location | Collaboration & pairing | Vim advocates | Daily routine | Standups | Clients | Coffee walks | Investment Fridays |</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                height="166"
                scrolling="no"
                frameborder="no"
                src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/218101809&color=ff0000&...>
              </div>
            </article>
          </li>
        </ul>
      </div>

      <section>
        <div class='pagination-link'><a href="page/2/">Older Stuff »</a></div>
      </section>
    </div>

    <footer>
      <div class='footer-tags'>
        <h3>Random Tags</h3>
        <ul class='random-tag-list'>
          <li><a href="/tags/exercise/">Exercise</a></li>
          <li><a href="/tags/company-benefits/">Company benefits</a></li>
          <li><a href="/tags/tmux/">Tmux</a></li>
          <li><a href="/tags/filetask/">FileTask</a></li>
          <li><a href="/tags/decision-making/">Decision making</a></li>
          <li><a href="/tags/favorite-feature/">Favorite feature</a></li>
          <li><a href="/tags/working-out/">Working out</a></li>
          <li><a href="/tags/scott-savarie/">Scott Savarie</a></li>
          <li><a href="/tags/titles/">Titles</a></li>
          <li><a href="/tags/erik-spiekermann/">Erik Spiekermann</a></li>
          <li><a href="/tags/newbie-mistakes/">Newbie mistakes</a></li>
          <li><a href="/tags/playbook/">Playbook</a></li>
          <li><a href="/tags/delegation/">Delegation</a></li>
          <li><a href="/tags/heat-maps/">Heat maps</a></li>
          <li><a href="/tags/europe/">Europe</a></li>
          <li><a href="/tags/sizing-type/">Sizing type</a></li>
          <li><a href="/tags/focus/">Focus</a></li>
          <li><a href="/tags/virtual-assistants/">Virtual assistants</a></li>
          <li><a href="/tags/writing/">Writing</a></li>
          <li><a href="/tags/hacking/">Hacking</a></li>
        </ul>
      </div>

      <div class='recent-posts'>
        <h3>Random Interviewees</h3>
        <ul>
          <li><a href="/tags/joel-glovier/">Joel Glovier</a></li>
          <li><a href="/tags/corwin-harrell/">Corwin Harrell</a></li>
          <li><a href="/tags/mario-c-delgado/">Mario C. Delgado</a></li>
          <li><a href="/tags/tom-dale/">Tom Dale</a></li>
          <li><a href="/tags/obie-fernandez/">Obie Fernandez</a></li>
          <li><a href="/tags/chad-pytel/">Chad Pytel</a></li>
          <li><a href="/tags/zach-holman/">Zach Holman</a></li>
          <li><a href="/tags/max-luster/">Max Luster</a></li>
          <li><a href="/tags/kyle-fiedler/">Kyle Fiedler</a></li>
          <li><a href="/tags/roberto-machado/">Roberto Machado</a></li>
        </ul>
      </div>
    </footer>
  </body>
</html>

이 팟캐스트에는 페이지에 있는 다른 요소의 수가 적으므로 Mechanize::Page가 있습니다. github.com에서 반환됩니다. 보다 다양한 콘텐츠를 볼 수 있습니다. 느낌을 받는 것이 중요하다고 생각합니다.

github.com 출력

#<Mechanize::Page
 {url #https://github.com/>}
 {meta_refresh}
 {title "How people build software · GitHub"}
 {iframes}
 {frames}
 {links
  #<Mechanize::Page::Link "Skip to content" "#start-of-content">
  #https://github.com/">
  #<Mechanize::Page::Link "\n          Personal\n" "/personal">
  #<Mechanize::Page::Link "\n          Open source\n" "/open-source">
  #<Mechanize::Page::Link "\n          Business\n" "/business">
  #<Mechanize::Page::Link "\n          Explore\n" "/explore">
  #<Mechanize::Page::Link "Sign up" "/join?source=header-home">
  #<Mechanize::Page::Link "Sign in" "/login">
  #<Mechanize::Page::Link "Pricing" "/pricing">
  #<Mechanize::Page::Link "Blog" "/blog">
  #https://help.github.com">
  #https://github.com/search">
  #https://help.github.com/terms">
  #https://help.github.com/privacy">
  #<Mechanize::Page::Link "Sign up for GitHub" "/join?source=button-home">
  #<Mechanize::Page::Link
   "\n      \n        \n      \n      \n        A whole new Universe\n        \n          Learn about the exciting features and announcements revealed at this year's GitHub Universe conference.\n        \n      \n    "
   "/universe-2016">
  #<Mechanize::Page::Link "Individuals " "/personal">
  #<Mechanize::Page::Link "Communities " "/open-source">
  #<Mechanize::Page::Link "Businesses " "/business">
  #<Mechanize::Page::Link "NASA" "//github.com/nasa">
  #<Mechanize::Page::Link "Sign up for GitHub" "/join?source=button-home">
  #https://github.com/contact">
  #https://developer.github.com">
  #https://training.github.com">
  #https://shop.github.com">
  #https://github.com/blog">
  #https://github.com/about">
  #https://github.com">
  #https://github.com/site/terms">
  #https://github.com/site/privacy">
  #https://github.com/security">
  #https://status.github.com/">
  #https://help.github.com">
  #<Mechanize::Page::Link "Reload" "">
  #<Mechanize::Page::Link "Reload" "">}
 {forms
  #<Mechanize::Form
   {name nil}
   {method "GET"}
   {action "/search"}
   {fields
    [hidden:0x3feb90f8297c type: hidden name: utf8 value: ✓]
    [text:0x3feb90f827d8 type: text name: q value: ]}
   {radiobuttons}
   {checkboxes}
   {file_uploads}
   {buttons}>
  #<Mechanize::Form
   {name nil}
   {method "POST"}
   {action "/join"}
   {fields
    [hidden:0x3feb90f7be38 type: hidden name: utf8 value: ✓]
    [hidden:0x3feb90f7bbb8 type: hidden name: authenticity_token value: vjRATKj7smXreq6Lt02r+MzW+ewWoi+fRzQXPedFAlOZgwzxQ0dZnChirhDfd7vyWZZZBO+ZFydLNedjIEDsrQ==]
    [text:0x3feb90f7b9d8 type: text name: user[login] value: ]
    [text:0x3feb90f7b7f8 type: text name: user[email] value: ]
    [field:0x3feb90f7b654 type: password name: user[password] value: ]
    [hidden:0x3feb90f7b474 type: hidden name: source value: form-home]}
   {radiobuttons}
   {checkboxes}
   {file_uploads}
   {buttons [button:0x3feb90f7a038 type: submit name:  value: ]}>}>

팟캐스트로 돌아가서 인코딩, HTTP 응답 코드, URI 또는 ​​응답 헤더와 같은 항목을 볼 수도 있습니다.

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

podcast_url = "https://betweenscreens.fm/"

page = agent.get(podcast_url)

puts 'Encodings'
puts page.encodings
puts 'Repsonse Headers'
puts page.response
puts 'HTTP response code'
puts page.code
puts 'URI'
puts page.uri

출력

Encodings
EUC-JP
utf-8
utf-8

Repsonse Headers
{"server"=>"GitHub.com", "date"=>"Sat, 29 Oct 2016 17:56:00 GMT", "content-type"=>"text/html; charset=utf-8", "transfer-encoding"=>"chunked", "last-modified"=>"Fri, 28 Oct 2016 01:48:56 GMT", "access-control-allow-origin"=>"*", "expires"=>"Sat, 29 Oct 2016 18:06:00 GMT", "cache-control"=>"max-age=600", "content-encoding"=>"gzip", "x-github-request-id"=>"501C936D:C723:1631523C:5814E2B0"}

HTTP response code
200

URI
https://betweenscreens.fm/

더 깊이 파고 싶다면 더 많은 것들이 있습니다. 나는 그것을 남겨 둘 것이다.

노코기리 방법

  • at
  • search

Mechanize는 Nokogiri를 사용하여 페이지에서 데이터를 스크랩합니다. 첫 번째 기사에서 Nokogiri에 대해 배운 내용을 적용하고 Mechanize 페이지에서도 사용할 수 있습니다. 즉, 일반적으로 Mechanize를 사용하여 페이지를 탐색하고 스크래핑 요구 사항에 대한 Nokogiri 방법을 사용합니다.

예를 들어 단일 개체를 검색하려면 at를 사용할 수 있습니다. , 동안 search 특정 페이지의 선택기와 일치하는 모든 개체를 반환합니다. 다시 말해서 이러한 방법은 Nokogiri 문서 개체와 Mechanize 페이지 개체 모두에서 작동합니다.

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

podcast_url = "https://betweenscreens.fm/"

page = agent.get(podcast_url)

first_title = page.at('h2.post-title')

all_titles = page.search('h2.post-title')

all_titles.each do |title|
  puts title
end

puts " * "*33

puts first_title

출력

<h2 class="post-title"><a href="episodes/144/">Randy J. Hunt</a></h2>
<h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2>
<h2 class="post-title"><a href="episodes/142/">David Heinemeier Hansson</a></h2>
<h2 class="post-title"><a href="episodes/141/">Zach Holman</a></h2>
<h2 class="post-title"><a href="episodes/140/">Joel Glovier</a></h2>
<h2 class="post-title"><a href="episodes/139/">João Ferreira</a></h2>
<h2 class="post-title"><a href="episodes/138/">Corwin Harrell</a></h2>
 *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  * 
<h2 class="post-title"><a href="episodes/144/">Randy J. Hunt</a></h2>

링크

  • links
  • link_with
  • links_with

전체 사이트를 원하는 대로 탐색할 수도 있습니다. 아마도 Mechanize의 가장 중요한 부분은 링크를 가지고 놀 수 있는 기능일 것입니다. 그렇지 않으면 Nokogiri 자체를 고수할 수 있습니다. 페이지에 링크를 요청하면 반환되는 내용을 살펴보겠습니다.

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

podcast_url = "https://betweenscreens.fm/"

page = agent.get(podcast_url)

puts "#{page.links}"

출력

[#<Mechanize::Page::Link "Logo cube" "/">
, #https://github.com/vis-kid/betweenscreens">
, #<Mechanize::Page::Link "about" "pages/about/">
, #<Mechanize::Page::Link "design" "design/">
, #<Mechanize::Page::Link "code" "code/">
, #<Mechanize::Page::Link "Randy J. Hunt" "episodes/144/">
, #<Mechanize::Page::Link "Jason Long" "episodes/143/">
, #<Mechanize::Page::Link "David Heinemeier Hansson" "episodes/142/">
, #<Mechanize::Page::Link "Zach Holman" "episodes/141/">
, #<Mechanize::Page::Link "Joel Glovier" "episodes/140/">
, #<Mechanize::Page::Link "João Ferreira" "episodes/139/">
, #<Mechanize::Page::Link "Corwin Harrell" "episodes/138/">
, #<Mechanize::Page::Link "Older Stuff »" "page/2/">
, #<Mechanize::Page::Link "Exercise" "/tags/exercise/">
, #<Mechanize::Page::Link "Company benefits" "/tags/company-benefits/">
, #<Mechanize::Page::Link "Tmux" "/tags/tmux/">
, #<Mechanize::Page::Link "FileTask" "/tags/filetask/">
, #<Mechanize::Page::Link "Decision making" "/tags/decision-making/">
, #<Mechanize::Page::Link "Favorite feature" "/tags/favorite-feature/">
, #<Mechanize::Page::Link "Working out" "/tags/working-out/">
, #<Mechanize::Page::Link "Scott Savarie" "/tags/scott-savarie/">
, #<Mechanize::Page::Link "Titles" "/tags/titles/">
, #<Mechanize::Page::Link "Erik Spiekermann" "/tags/erik-spiekermann/">
, #<Mechanize::Page::Link "Newbie mistakes" "/tags/newbie-mistakes/">
, #<Mechanize::Page::Link "Playbook" "/tags/playbook/">
, #<Mechanize::Page::Link "Delegation" "/tags/delegation/">
, #<Mechanize::Page::Link "Heat maps" "/tags/heat-maps/">
, #<Mechanize::Page::Link "Europe" "/tags/europe/">
, #<Mechanize::Page::Link "Sizing type" "/tags/sizing-type/">
, #<Mechanize::Page::Link "Focus" "/tags/focus/">
, #<Mechanize::Page::Link "Virtual assistants" "/tags/virtual-assistants/">
, #<Mechanize::Page::Link "Writing" "/tags/writing/">
, #<Mechanize::Page::Link "Hacking" "/tags/hacking/">
, #<Mechanize::Page::Link "Joel Glovier" "/tags/joel-glovier/">
, #<Mechanize::Page::Link "Corwin Harrell" "/tags/corwin-harrell/">
, #<Mechanize::Page::Link "Mario C. Delgado" "/tags/mario-c-delgado/">
, #<Mechanize::Page::Link "Tom Dale" "/tags/tom-dale/">
, #<Mechanize::Page::Link "Obie Fernandez" "/tags/obie-fernandez/">
, #<Mechanize::Page::Link "Chad Pytel" "/tags/chad-pytel/">
, #<Mechanize::Page::Link "Zach Holman" "/tags/zach-holman/">
, #<Mechanize::Page::Link "Max Luster" "/tags/max-luster/">
, #<Mechanize::Page::Link "Kyle Fiedler" "/tags/kyle-fiedler/">
, #<Mechanize::Page::Link "Roberto Machado" "/tags/roberto-machado/">
]

성지, 이것을 분해해 봅시다. Mechanize에게 다른 곳을 보도록 지시하지 않았기 때문에 첫 페이지에서만 링크 배열을 얻었습니다. Mechanize는 해당 페이지를 내림차순으로 살펴보고 이 링크 목록을 위에서 아래로 반환합니다. 출력에서 볼 수 있는 다양한 링크에 대한 녹색 포인터가 있는 작은 이미지를 만들었습니다.

그건 그렇고, 이것은 이미 제 팟캐스트를 위한 재설계의 최종 결과를 보여주고 있습니다. 나는 이 버전이 데모용으로 조금 더 낫다고 생각한다. 또한 최종 결과가 어떻게 보이는지, 이전 Sinatra 사이트를 스크랩해야 하는 이유도 알 수 있습니다.

스크린샷

첫 번째 Web Scraper 만들기, 2부 첫 번째 Web Scraper 만들기, 2부 첫 번째 Web Scraper 만들기, 2부

항상 그렇듯이 여기에서 텍스트만 추출할 수도 있습니다.

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

podcast_url = "https://betweenscreens.fm/"

page = agent.get(podcast_url)

page.links.each do |link|
  puts link.text
end

출력

Logo cube
fork!
about
design
code
Randy J. Hunt
Jason Long
David Heinemeier Hansson
Zach Holman
Joel Glovier
João Ferreira
Corwin Harrell
Older Stuff »
Exercise
Company benefits
Tmux
FileTask
Decision making
Favorite feature
Working out
Scott Savarie
Titles
Erik Spiekermann
Newbie mistakes
Playbook
Delegation
Heat maps
Europe
Sizing type
Focus
Virtual assistants
Writing
Hacking
Joel Glovier
Corwin Harrell
Mario C. Delgado
Tom Dale
Obie Fernandez
Chad Pytel
Zach Holman
Max Luster
Kyle Fiedler
Roberto Machado

이러한 모든 링크를 대량으로 가져오는 것은 매우 유용하거나 단순히 지루할 수 있습니다. 운 좋게도 필요한 것을 미세 조정할 수 있는 몇 가지 도구가 있습니다.

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

podcast_url = "https://betweenscreens.fm/"

page = agent.get(podcast_url)

focus_link = agent.page.links.find { |link| link.text == 'Focus' }

puts focus_link

출력

Focus

팔! 이제 우리는 어딘가에 도착하고 있습니다! 우리는 이와 같은 특정 링크를 확대할 수 있습니다. 예를 들어 links_with와 같은 더 멋진 API를 사용하여 텍스트와 같은 특정 기준과 일치하는 링크를 타겟팅할 수 있습니다. 또는 link_with . 또한 여러 Focus가 있는 경우 링크에서 대괄호 []를 사용하여 페이지의 특정 번호를 확대할 수 있습니다. .

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

podcast_url = "https://betweenscreens.fm/"

page = agent.get(podcast_url)

focus_link = agent.page.links_with(:text => 'Focus')[2]

puts focus_link

링크 텍스트 뒤에 있지 않고 링크 자체가 있는 경우 특정 href만 지정하면 됩니다. 그 링크를 찾기 위해. 기계화는 당신을 방해하지 않을 것입니다. text 대신 , 메소드에 href를 제공합니다. .

some_scraper.rb

page = agent.page.link_with(href: '/episodes/95/')

page = agent.page.links_with(href: '/episodes/95/')

원하는 텍스트가 있는 첫 번째 링크만 찾으려면 이 구문을 사용할 수도 있습니다. 매우 편리하고 조금 더 읽기 쉽습니다.

some_scraper.rb

focus_links = agent.page.link_with(:text => 'Focus')

그 친구를 팔로우하고 이 Focus 뒤에 무엇이 숨겨져 있는지 확인하는 것은 어떻습니까? 링크? click합시다. 그것!

클릭

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

podcast_url = "https://betweenscreens.fm/"

page = agent.get(podcast_url)

focus_links = agent.page.links.find { |link| link.text == 'Focus' }.click.links

puts focus_links

이렇게 하면 이전과 같은 또 다른 긴 링크 목록이 표시됩니다. .click.links를 결합하는 것이 얼마나 쉬운지 알아보십시오. . Mechanize는 링크를 클릭하고 페이지를 따라 새 목적지로 이동합니다. 링크 목록도 요청했기 때문에 Mechanize가 해당 새 페이지에서 찾을 수 있는 모든 링크를 가져옵니다.

동일한 인터뷰 대상자의 두 개의 텍스트 링크(하나는 태그로 연결되고 다른 하나는 최근 에피소드로 연결됨)가 있고 이 각 페이지에서 링크를 가져오려고 한다고 가정해 보겠습니다.

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

podcast_url = "https://betweenscreens.fm/"

page = agent.get(podcast_url)

links = agent.page.links_with(text: "Some interviewee")

links.each do |link|
  puts link.click.links
end

이렇게 하면 두 페이지에 대한 링크 목록이 제공됩니다. 인터뷰 대상자의 각 링크를 반복하면 Mechanize가 클릭한 링크를 따라 새 페이지에서 찾은 링크를 수집합니다. 아래에서 조합을 비교하여 시작할 수 있는 몇 가지 예를 찾을 수 있습니다.

some_scraper.rb

agent.page.links.find { |l| l.text == 'Focus' }
agent.page.links.find { |l| l.text == 'Focus' }.click
agent.page.link_with(text: 'Focus')
agent.page.links_with(text: 'Focus')[0]
agent.page.links_with(text: 'Focus')[1].click
agent.page.links_with(text: 'Focus')[2].click.links
agent.page.link_with(href: '/some-href')
agent.page.link_with(href: '/some-href').click
agent.page.links_with(href: '/some-href')
agent.page.links_with(href: '/some-href').click

양식

  • submit
  • field_with
  • checkbox_with
  • radiobuttons_with
  • file_uploads

양식에 대해 알아보겠습니다!

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

google_url = "https://google.com/"

page = agent.get(google_url)

forms = page.forms

puts forms.inspect

출력

[#<Mechanize::Form
# Attention!!
 {name "f"}
# Attention!!
 {method "GET"}
 {action "/search"}
 {fields
  [hidden:0x3fea91d2eb08 type: hidden name: ie value: ISO-8859-1]
  [hidden:0x3fea91d2e964 type: hidden name: hl value: es]
  [hidden:0x3fea91d2e7e8 type: hidden name: source value: hp]
  [hidden:0x3fea91d2e5f4 type: hidden name: biw value: ]
  [hidden:0x3fea91d2e428 type: hidden name: bih value: ]
# Attention!!
  [text:0x3fea91d2e248 type:  name: q value: ]
# Attention!!
  [hidden:0x3fea91d2bcb4 type: hidden name: gbv value: 1]}
 {radiobuttons}
 {checkboxes}
 {file_uploads}
 {buttons
  [submit:0x3fea91d2e0f4 type: submit name: btnG value: Buscar con Google]
  [submit:0x3fea91d2be80 type: submit name: btnI value: Voy a tener suerte]}>
]

Because we use the forms method, we get an array returned—even when we only have one form returned to us. Now that we know that the form has the name "f" , we can use the singular version form to hone in on that one.

...

{name "f"}

...

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

google_url = "https://google.com/"

page = agent.get(google_url)

search_form = page.form('f')

puts search_form.inspect

Using form('f') , we singled out the particular form we want to work with. As a result, we will not get an array returned.

출력

#<Mechanize::Form
# Attention!!
 {name "f"}
# Attention!!
 {method "GET"}
 {action "/search"}
 {fields
  [hidden:0x3ffe9ce85ba4 type: hidden name: ie value: ISO-8859-1]
  [hidden:0x3ffe9ce859d8 type: hidden name: hl value: es]
  [hidden:0x3ffe9ce857bc type: hidden name: source value: hp]
  [hidden:0x3ffe9ce85618 type: hidden name: biw value: ]
  [hidden:0x3ffe9ce853e8 type: hidden name: bih value: ]
# Attention!!
  [text:0x3ffe9ce851cc type:  name: q value: ]
# Attention!!
  [hidden:0x3ffe9ce84bdc type: hidden name: gbv value: 1]}
 {radiobuttons}
 {checkboxes}
 {file_uploads}
 {buttons
  [submit:0x3ffe9ce85078 type: submit name: btnG value: Buscar con Google]
  [submit:0x3ffe9ce84e48 type: submit name: btnI value: Voy a tener suerte]}>

We can also identify the name of the text input field (q ).

...

[text:0x3ffe9ce851cc type:  name: q value: ]

...

We can target it by that name and set its value like Ruby attributes. All we need to do is provide it with a new value. You can see from the output example above that it is empty by default.

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

google_url = "https://google.com/"

page = agent.get(google_url)

search_form = page.form('f')
search_form.q = 'New Google Search'

puts search_form.inspect

출력

#<Mechanize::Form
 {name "f"}
 {method "GET"}
 {action "/search"}
 {fields
  [hidden:0x3fcb85b6a784 type: hidden name: ie value: ISO-8859-1]
  [hidden:0x3fcb85b6a57c type: hidden name: hl value: es]
  [hidden:0x3fcb85b6a3b0 type: hidden name: source value: hp]
  [hidden:0x3fcb85b6a16c type: hidden name: biw value: ]
  [hidden:0x3fcb85b67f20 type: hidden name: bih value: ]
# Attention!!
  [text:0x3fcb85b67d18 type:  name: q value: New Google Search]
# Attention!!
  [hidden:0x3fcb85b67728 type: hidden name: gbv value: 1]}
 {radiobuttons}
 {checkboxes}
 {file_uploads}
 {buttons
  [submit:0x3fcb85b67b9c type: submit name: btnG value: Buscar con Google]
  [submit:0x3fcb85b67994 type: submit name: btnI value: Voy a tener suerte]}>

As you can observe above, the value for the text field has changed to New Google Search . Now we only need to submit the form and collect the results from the page that Google returns. It couldn’t be any easier. Let’s search for something else this time!

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

google_url = "https://google.com/"
page = agent.get(google_url)

search_form = page.form('f')
search_form.q = 'GitHub TouchFart'

page = agent.submit(search_form)

pp page.search('h3.r').map(&:text)

Here I identified the search results header using a CSS selector h3.r , mapped its text , and pretty printed the results. Wasn’t that hard, was it? That is an easy example, sure, but think about the endless possibilities you have at your disposal with this!

출력

["GitHub - hungtruong/TouchFart: A fart app for the new Macbook ...",
 "TouchFart/TouchFart at master · hungtruong/TouchFart · GitHub",
 "Commits · hungtruong/TouchFart · GitHub",
 "Projects · hungtruong/TouchFart · GitHub",
 "Pull Requests · hungtruong/TouchFart · GitHub",
 "Issues · hungtruong/TouchFart · GitHub",
 "TouchFart/license.txt at master · hungtruong/TouchFart · GitHub",
 "Add autoplay attribute to <audio> tag and touchfart (er ... - GitHub",
 "Find file - File Finder · GitHub",
 "Fart app for the new Macbook Pro's Touch... #3860 on topic touchfart ..."]

Mechanize has different input fields available for you to play with. You can even upload files!

  • field_with
  • checkbox_with
  • radiobuttons_with
  • file_uploads

You can also identify radio buttons and checkboxes by their name and check them with—you guessed it—check .

some_scraper.rb

form.radiobuttons_with(:name => 'gender')[3].check

form.checkbox_with(:name => 'coder').check

Option tags offer users to select one item from a drop-down list. Again, we target them by name and select the option number we want.

some_scraper.rb

form.field_with(:name => 'countries').options[22].select

File uploads work similar to inputing text into forms by setting it like Ruby attributes. You identify the upload field and then specify the file path (file name) you want to transfer. It sounds more complicated than it is. Let’s have a look!

some_scraper.rb

form.file_uploads.first.file_name = "some-path/some-image.jpg"

최종 생각

See, no magic after all! You are now well equipped to have some fun on your own. There is certainly a bit more to learn about Nokogiri and Mechanize, but instead of spending too much time on unnecessary aspects, play around with it and look into some more documentation when you run into problems beyond the scope of a beginner article.

I hope you can see how beautifully simple this gem is and how much power it offers. As we all know from popular culture by now, this also bears responsibilities. Use it within legal frameworks and when you have no access to an API. You probably won’t have a frequent use for these tools, but boy do they come in handy when you have some real scraping needs ahead of you.

As promised, in the next article we will cover a real-world example where I will scrape data from my podcast site. I will extract it from an old Sinatra site and transfer it over to my new Middleman site that uses .markdown files for each episode. We will extract the dates, episodes numbers, interviewee names, headers, subheaders, and so on. See you there!