StatisticsTracker

• 통계 및 리포팅 기능을 수행한다.
각각의 스냅샷 마다 'progress-statistics.log' file에 해당 정보가 남는다.

기존 설정 값
<bean id="statisticsTracker" class="org.archive.crawler.reporting.StatisticsTracker" autowire="byName">
  <!-- <property name="reports">
        <list>
         <bean id="crawlSummaryReport" class="org.archive.crawler.reporting.CrawlSummaryReport" />
         <bean id="seedsReport" class="org.archive.crawler.reporting.SeedsReport" />
         <bean id="hostsReport" class="org.archive.crawler.reporting.HostsReport" />
         <bean id="sourceTagsReport" class="org.archive.crawler.reporting.SourceTagsReport" />
         <bean id="mimetypesReport" class="org.archive.crawler.reporting.MimetypesReport" />
         <bean id="responseCodeReport" class="org.archive.crawler.reporting.ResponseCodeReport" />
         <bean id="processorsReport" class="org.archive.crawler.reporting.ProcessorsReport" />
         <bean id="frontierSummaryReport" class="org.archive.crawler.reporting.FrontierSummaryReport" />
         <bean id="frontierNonemptyReport" class="org.archive.crawler.reporting.FrontierNonemptyReport" />
         <bean id="toeThreadsReport" class="org.archive.crawler.reporting.ToeThreadsReport" />
        </list>
       </property> -->
  <!-- <property name="reportsDir" value="${launchId}/reports" /> -->
  <!-- <property name="liveHostReportSize" value="20" /> -->
  <!-- <property name="intervalSeconds" value="20" /> -->
  <!-- <property name="keepSnapshotsCount" value="5" /> -->
 </bean>
•[property에 대한 설명]
1.reports : reporting 할 목록을 적어준다. 각 리포트 마다 관리하는 bean이 다르다.     1.CrawlSummaryReport : 전반적인 크롤 사이즈에 대한 요약
(ex)
crawl name: basic
crawl status: Finished
duration: 15s8ms

seeds crawled: 2
seeds uncrawled: 0

hosts visited: 11

URIs processed: 81
URI successes: 81
URI failures: 0
URI disregards: 0

novel URIs: 81

total crawled bytes: 813450 (794 KiB)
novel crawled bytes: 813450 (794 KiB)

URIs/sec: 5.4
KB/sec: 52

2.SeedsReport : 각 seed들에 대한 결과
[code] [status] [seed] [redirect]

3.HostsReport : 각 host에 의한 기록들
[#urls] [#bytes] [host] [#robots] [#remaining] [#novel-urls] [#novel-bytes] [#dup-by-hash-urls] [#dup-by-hash-bytes] [#not-modified-urls] [#not-modified-bytes]

4.SourceTagsReport : 각 host의 source tag들에 대한 기록을 담는다.(대개의 경우 seed임)
[source] [host] [#urls]


5.MimetypesReport : MIME 타입에 대한 기록들
[#urls] [#bytes] [mime-types]


6.ResponseCodeReport : response/disposition code 당 발견된 url 개수
[#urls] [rescode]

7.ProcessorsReport : CrawlController에 있는 각 processor들이 대표하는 class들과, 자신들의 목적에 부합하는 결과를 남긴다.
(ex)
CandidateChain - Processors report - 201608050647
  Number of Processors: 2
Processor: org.archive.crawler.prefetch.CandidateScoper
Processor: org.archive.crawler.prefetch.FrontierPreparer
FetchChain - Processors report - 201608050647
  Number of Processors: 9
Processor: org.archive.crawler.prefetch.Preselector
Processor: org.archive.crawler.prefetch.PreconditionEnforcer
Processor: org.archive.modules.fetcher.FetchDNS
Processor: org.archive.modules.fetcher.FetchHTTP
  Function:          Fetch HTTP URIs
  CrawlURIs handled: 71
  Recovery retries:   0
Processor: org.archive.modules.extractor.ExtractorHTTP
  4 links from 71 CrawlURIs
Processor: org.archive.modules.extractor.ExtractorHTML
  1063 links from 26 CrawlURIs
Processor: org.archive.modules.extractor.ExtractorCSS
  15 links from 9 CrawlURIs
Processor: org.archive.modules.extractor.ExtractorJS
  7 links from 8 CrawlURIs
Processor: org.archive.modules.extractor.ExtractorSWF
  0 links from 0 CrawlURIs
DispositionChain - Processors report - 201608050647
  Number of Processors: 3
Processor: org.archive.modules.writer.WARCWriterProcessor
  Function:          Writes WARCs
  Total CrawlURIs:   81
  Revisit records:   0
  Crawled content bytes (including http headers): 813450 (794 KiB)
  Total uncompressed bytes (including all warc records): 994111 (971 KiB)
  Total size on disk (compressed): 363358 (355 KiB)
Processor: org.archive.crawler.postprocessor.CandidatesProcessor
Processor: org.archive.crawler.postprocessor.DispositionProcessor

8.FrontierSummaryReport : 크롤링 중간에 보통 확인하는 파일로, 각 queue의 종류 별 상황을 요약해서 보여준다.
(ex)
Frontier report - 201608050647
 Job being crawled: basic

 -----===== STATS =====-----
 Discovered:    81
 Queued:        0
 Finished:      81
  Successfully: 81
  Failed:       0
  Disregarded:  0

 -----===== QUEUES =====-----
 Already included size:     81
               pending:     0

 All class queues map size: 11
             Active queues: 0
                    In-process: 0
                         Ready: 0
                       Snoozed: 0
           Inactive queues: 0 (p3: 0)
            Retired queues: 0
          Exhausted queues: 11

             Last state: EMPTY
 -----===== MANAGER THREAD =====-----
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
    java.lang.Thread.getStackTrace(Thread.java:1479)
    org.archive.crawler.framework.ToeThread.reportThread(ToeThread.java:484)
    org.archive.crawler.frontier.WorkQueueFrontier.reportTo(WorkQueueFrontier.java:1304)
    org.archive.crawler.reporting.FrontierSummaryReport.write(FrontierSummaryReport.java:39)
    org.archive.crawler.reporting.StatisticsTracker.writeReportFile(StatisticsTracker.java:910)
    org.archive.crawler.reporting.StatisticsTracker.dumpReports(StatisticsTracker.java:938)
    org.archive.crawler.reporting.StatisticsTracker.stop(StatisticsTracker.java:342)
    org.springframework.context.support.DefaultLifecycleProcessor.doStop(DefaultLifecycleProcessor.java:236)
    org.springframework.context.support.DefaultLifecycleProcessor.doStop(DefaultLifecycleProcessor.java:213)
    org.springframework.context.support.DefaultLifecycleProcessor.doStop(DefaultLifecycleProcessor.java:213)
    org.springframework.context.support.DefaultLifecycleProcessor.doStop(DefaultLifecycleProcessor.java:213)
    org.springframework.context.support.DefaultLifecycleProcessor.access$2(DefaultLifecycleProcessor.java:206)
    org.springframework.context.support.DefaultLifecycleProcessor$LifecycleGroup.stop(DefaultLifecycleProcessor.java:352)
.
.
.
 -----===== 11 LONGEST QUEUES =====-----
LONGEST#0:
:
:

LONGEST#1:
:
:

 -----===== IN-PROCESS QUEUES =====-----

 -----===== READY QUEUES =====-----

 -----===== SNOOZED QUEUES =====-----

 -----===== INACTIVE QUEUES =====-----

 -----===== RETIRED QUEUES =====-----

9.FrontierNonemptyReport : 보통 크롤링 마지막에 덤핑되는 파일로 empty 상태가 아닌 frontier queue에 대한 정보를 담는다.

10.ToeThreadsReport : toe thread의 call-stack에 대한 리포트

2.reportsDir : 리포트를 저장할 디렉토리

3.liveHostReportSize : 현재 진행중인 host에 대한 리포트를 남기기 위하여 쓰는 stack 포인터를 위한 변수이다.

4.intervalSeconds : 진행 정보를 남길 시에 interval 값

5.keepSnalshotsCount : 통계를 내기 위한 샘플 스냅샷 개수 설정.



Posted by Righ
,