Sina Blog Article Migration and Hidden Article Extraction Operation Manual

§This man­u­al is based on the need to findSina BlogArti­cles hid­den from the pub­lic are then trans­ferred in bulk to a Word­Press-based blog for pub­li­ca­tion.

1. Project Objectives

  • Full migra­tion: Migrate all Sina Blog arti­cles (includ­ing hid­den arti­cles) to Word­Press.
  • Iden­ti­fy the dif­fer­ences: Find posts that are vis­i­ble only to the blog­ger (not vis­i­ble to the pub­lic).
  • Batch import: Final­ly, a JSON/HTML file is gen­er­at­ed that can be import­ed into Word­Press.

2. Required environment

  1. Python 3.9+(It is rec­om­mend­ed to install the lat­est ver­sion of Python under Win­dows)
  2. Install depen­dent libraries: pip install requests beautifulsoup4 lxml

3. Scripts used

(1) claw_sina_blog.py

Func­tion: Batch crawl Sina blog arti­cles (pub­lic mode & login mode), save HTML and index index.json.
New fea­tures: Sup­port --cookie / --cookie-file, which can crawl arti­cles that are only vis­i­ble to blog­gers.

Com­mon para­me­ters:

  • --uid : Sina Blog UID (such as 1484861452).
  • --outdir : out­put direc­to­ry.
  • --start / --end : Spec­i­fy the start and end pages.
  • --cookie-file : Read cook­ies from a file (sin­gle line of text).

(2) dedup_index.py

Func­tion: Dedu­pli­ca­tion index.json, gen­er­ate a clean ver­sion index_clean.json.
The orig­i­nal crawl may con­tain dupli­cates (>3000), and after dedu­pli­ca­tion, there should be ≈ 387 arti­cles.


(3) sina_hidden_finder.py

Func­tion:

  1. Com­pare the index­es of the pub­lic and login schemas to find Only blog­gers can see the arti­cle.
  2. Gen­er­ate a man­i­fest (JSON/CSV).
  3. Batch down­load hid­den arti­cle HTML and save to downloads/.

Com­mon para­me­ters:

  • --public-index : Pub­lic mode index.json.
  • --owner-index : Login mode index_clean.json.
  • --cookie-file : Cook­ie file, crawl­ing while logged in.
  • --outdir : out­put direc­to­ry.

4. Operation steps

Step 1. Public mode crawling

python .\claw_sina_blog.py --uid 1484861452 --outdir .\sina_public --start 1 --end 8

gen­er­ate:

.\sina_public\index.json

Step 2. Login mode crawling

  1. Copy Cook­ie → Save As cookie.txt(Sin­gle line of text).
  2. imple­ment:
python .\claw_sina_blog.py --uid 1484861452 --outdir .\sina_owner --start 1 --end 8 --cookie-file .\cookie.txt

gen­er­ate:

.\sina_owner\index.json

Step 3. Deduplication

python .\dedup_index.py

gen­er­ate:

.\sina_owner\index_clean.json

Step 4. Difference Extraction & Hidden Article Download

python .\sina_hidden_finder.py ` --public-index .\sina_public\index.json ` --owner-index .\sina_owner\index_clean.json ` --cookie (Get-Content .\cookie.txt -Raw) ` --outdir .\sina_hidden ` --delay 1.0 ` --jitter 0.3

gen­er­ate:

  • .\sina_hidden\hidden_only.json
  • .\sina_hidden\hidden_only.csv
  • .\sina_hidden\downloads\*.html
  • .\sina_hidden\hidden_index.json

5. Quick Lookup of Commonly Used Commands

Public Mode

python .\claw_sina_blog.py --uid 1484861452 --outdir .\sina_public --start 1 --end 8

Login Mode

python .\claw_sina_blog.py --uid 1484861452 --outdir .\sina_owner --start 1 --end 8 --cookie-file .\cookie.txt

Deduplication

python .\dedup_index.py

Extract hidden articles

python .\sina_hidden_finder.py --public-index .\sina_public\index.json --owner-index .\sina_owner\index_clean.json --cookie (Get-Content .\cookie.txt -Raw) --outdir .\sina_hidden

6. Frequently Asked Questions (FAQ)

  • Q: The script keeps turn­ing pages too much?
    A: Plus --end 8 Lim­it the num­ber of pages; or use a mod­i­fied ver­sion crawl() Auto­mat­i­cal­ly detect dupli­cate pages.
  • Q: Cook­ie expi­ra­tion?
    A: Copy the com­plete cook­ie in the brows­er again and over­write it. cookie.txt Run again.
  • Q: There are too many index.json files (>3000)?
    A: This means there is dupli­ca­tion and you must run dedup_index.py Remove dupli­cates.
  • Q: How do I import into Word­Press?
    A: You can use the exist­ing wp_batch_import_v4_1_3.py, incom­ing hidden_index.json or full amount index_clean.json.

[Attach­ment: List of blog posts blocked by Sina Blog]

The Best Pic­ture Books Pub­lished in the Last Two Months (May-July 2008) 2008/7/13 10:14
[Book Excerpt] The World is Made for Good Peo­ple: Afan­ti’s Life and Child­hood (3) 2008/7/24 17:06
[Book Excerpt] The World is Made for Good Peo­ple: Afan­ti’s Life and Child­hood (2) 2008/7/24 17:08
A pic­ture book you can read while hold­ing it and recit­ing it: “Su Wu Shep­herd­ing Sheep” 2008/8/30 23:29
[Olympic Side Sto­ry] Egg pan­cakes, pirat­ed discs, and books reap­pear on Bei­jing streets… 2008/9/22 20:16
“How to Play with Pic­ture Books” Lec­ture Tran­script (Part 2) 2008/10/17 9:30
“How to Play with Pic­ture Books” Lec­ture Tran­script (Part 2) 2008/10/17 9:35
[Notes] “How can par­ents help their chil­dren fall in love with writ­ing?” 2009/2/2 14:44
A‑Jia, Car­rot Detec­tive’s 2009 Sum­mer Book Rec­om­men­da­tions (Part 2) 2009/7/16 21:45
Ran­dom Notes: From Androg­y­ny to Mod­ern Witch­es to Lao Tzu’s Fem­i­nist Per­spec­tives (2009/9/21 9:37)
Con­tin­ue to dis­cuss par­ent­ing strate­gies with Yangyang’s dad: read­ing, ask­ing ques­tions, rewards, etc.… 2009/10/16 14:28
Con­tin­u­ing the Gos­sip Chap­ter of the Love Cul­ti­va­tion Chap­ter — A Sto­ry of a Moth­er Who Aban­doned Her Child… 2010/1/24 21:52
Ger­man Con­tem­po­rary Chil­dren’s Book Illus­tra­tion Exhi­bi­tion: Exhi­bi­tion Overview, Exhibits, and Illus­tra­tors (2010/3/31 11:34)
Read­ing His­to­ry Excerpts: A “Super­sti­tious” Account of Nat­ur­al Dis­as­ters in Han Chi­nese Sto­ries (2010/4/17 23:16)
Mas­ters of the Art of Sto­ry­telling for Chil­dren (V) 2010/5/27 22:13
Mas­ters of the Art of Sto­ry­telling for Chil­dren (Part 6) 2010/5/30 21:57
[Repost] Exam­ple of a per­son­al appli­ca­tion process for “Green Child” (2010/6/17 10:19)
Wei­bo Chat: Watch­ing and Talk­ing About Movies with Kids 2010/12/20 10:01
Chat­ting with Chil­dren about “Sev­en-Char­ac­ter Verse: The Long March” 2010/12/30 15:41
[Man­darin and Can­tonese] “Five Hun­dred Words of Reflec­tion on the Jour­ney from Bei­jing to Fengx­i­an Coun­ty” Recita­tion and Pro­nun­ci­a­tion Demon­stra­tion 2011/1/2 23:42
[Read­ing Notes] Who actu­al­ly plant­ed the sparse fence in Du Fu’s “Pre­sent­ed to Wu Lang Again”? 2011/6/14 0:12
Noise and com­mo­tion? Col­or­ful and har­mo­nious… Reflec­tions on “Sounds in the Park” (2011/6/23 23:24)
[Repost] Ube Live Record­ing: Nao Mat­sui’s Favorite Pic­ture Books — Japan­ese Pic­ture Books 2011/8/28 21:40
A Brief Overview of the Devel­op­ment of Pic­ture Books World­wide (Based on Cur­rent­ly Pub­lished Eng­lish Pic­ture Books) 2011/9/21 23:01
[Post­script] The fas­ci­nat­ing qui­et, leisure­ly, and joy­ful life 2012/5/29 10:06
[Repost] Pre-reg­is­tra­tion Notice for the Fifth Red Mud Chil­dren’s Clas­sic Book Study Ses­sion (2012/8/22 23:29)
[Repost] Leo Lion­ni and his field mouse Alfred 2012/10/8 21:55
[Repost] Fairy Tale? Not a Fairy Tale? Intro­duc­tion to the Red Mud Dig­i­tal Plat­form Library (2012/10/12 9:25)
[Reprint] Chen Lu: The Social Net­work Struc­ture of Chil­dren’s Read­ing Pro­mo­tion in the Unit­ed States (2012/12/19 9:15)
[Repost] Rain or shine – Red Mud Osen Book Club 2013/7/2 11:16
[Repost] Per­fec­tion (Hip­po Reviews Loren­z’s Pop­u­lar Sci­ence Clas­sics Series, Vol­ume 5) 2013/12/16 10:45
[Repost] Reg­is­tra­tion Notice for the 7th Red Mud Chil­dren’s Clas­sic Book Study Ses­sion (2013/12/28 11:22)
Rec­om­mend a small pro­fes­sion­al library soft­ware (free use with­in 1000 titles) 2014/5/14 9:59
Index of names, titles, and key­words for Dear Genius (sort­ed by orig­i­nal Eng­lish title) 2014/6/15 22:30
Why is Dear Genius an impor­tant book? 2014/9/25 0:03
A Sum­mer Jour­ney Through Amer­i­can Chil­dren’s Lit­er­a­ture (Octo­ber 29, 2014, 6:28 PM)
Repost: Mag­ic Chil­dren’s Book Club’s Zhang Hong — Review of the Ten Secrets of Pic­ture Book Play­ing Lec­ture in Shang­hai 2015/3/26 11:40
Key Points from Lec­tures 1 and 2 of the “Sto­ries of Genius­es” Series (June 12, 2015, 8:37 AM)
The 8th Seed Sto­ry­teller Train­ing Ses­sion 2: Inspec­tor Radish — The Secret Path to Oz 2016/4/21 16:42
[Trans­la­tion Notes] “The Boy and the Cher­ry Tree,” Dreams and Per­se­ver­ance… 2016/5/20 10:54
Pic­ture Book Cre­ation Inter­view: Lis­ten to Teacher Cai Gao Talk About “Meng Jiangnu Weep­ing at the Great Wall” 2016/5/21 14:51
Audio Col­umn “Ajia Sto­ry­telling” Launch Notes 2016/12/15 21:02
How did the Chi­nese-fla­vored “Princess’s Kite” come about? 2017/1/2 16:44
[Repost] Bar­ry Moser is a famous artist 2017/3/30 18:41
[Notes] Peter Rab­bit’s Char­ac­ter Design and Devel­op­ment (Part 2) 2018/1/9 10:05
[Notes] Peter Rab­bit’s Char­ac­ter Design and Devel­op­ment (Part 3) 2018/1/11 10:42
A Brief Overview of the Devel­op­ment of Pic­ture Books World­wide (Based on Cur­rent­ly Pub­lished Eng­lish Pic­ture Books) 2019/5/22 20:59

Comment

Leave a Reply

Your email address will not be pub­lished. Required fields are marked *