telco dnlg:inventory,parser

A phone company (or a telecom operator) with millions of customers need to maintain that many “circuit descriptors” in some database. I'll call them circuit descriptor databases — cdd, but Verizon call them INVENTORIES.

In Verizon, a complete circuit descriptor is known as a DESIGN or a LINE RECORD, to be PARSEd.

Databases can be relational or non-relational. In Verizon, most if not all of these disparate databases (NSDB, iview, Virtual Inventory ..) are accessed through their FT (flow through?). An FT is the proxy sitting in front of a database. Any system to query the cdd must go through the FT, load the line record and parse it.

circuit descriptors are stored in different INVENTORIES according to regions/states and “services” such as FTTP, Layer-1…

telco dnlg:expert system maintenance

sscfi and other Expert systems often make mistakes.

Q: is a mistake always easy to recognize by a human?

A: usually yes for sscfi

q: example of an obvious mistake in sccfi?

A: wrong diagnosis

Q: what other mistakes?

A: slowness

Q: How do u reduce recurring mistakes?

A: reviews by human experts

Q: how many such reviews each month?

A: about 1/day

zed transaction volume – 50 tx/sec

Hi Tan,
Good to hear from you. We would have done close to 50 tps second during the
peak time.
Partha
—– Original Message —–
Sent: Wednesday, June 13, 2007 11:52 AM
Subject: interview question

> Hi Partha,
>
> How’s your wife and daughter?
>
> I’m preparing for some interviews. One interviewer asked “how many
> transactions per minute at your (zed’s) peak load”. I said more than
> 50/minute and he laughed.
>
> Any idea about the peak volume in 2002?
>
> Thanks
>
>
>

challenges ] %%FTTP task

ranked in terms of “usefulness” — details, easy-to-grasp, ….
) de-couple — the design is by default tightly coupled

) quality assurance — of the model. How do we know for sure that the right classes were used, and hundreds of attributes are set exactly right? Complications
* runtime binding
* object behaviours determined at run time based on neighbours whose exact types are determined at runtime. Consider setAid()
=> recursive dumper
=> unit tests

) /tacit-knowledge/ instead of documentation
) other deveopers are mostly “offsite”

) error handling — not too lenient as to return unusable objects, not over-reaction as to crash the system

) multiple-inheritance — an acknowledged necessary evil when converting from LISP

) testing is tedious
* doesn’t allow “micro cycles”. When you want to add a debugging statement or make any minor change, usually you must shutdown both server and swing client, edit, recompile, (perhaps re-deploy,) restart server and swing client, go through the client dialog to specify your input, wait for a few seconds to see result.
* Worse still, you may need to run a corba agent to interface with a remote corba server

) mem leak, out-of-mem in load test

) perf — Many design considerations
=> element filters
=> pre-fabricate common element objects before trying out templates

) thread safety — open call …

—- hard to lj, or otherwise not useful:
) validation — non-standard line records abound. A well-known issue. large number of possible errors in input (line record) just like error checking
) AI to choose the right template

NextGen server mean time to failure?

Hi,

Just curious — In production How often would we need to restart NextGen server? In other words, what’s the mean time to failure, and what’s the service level agreement? I’m not advocating any change. I understand our approach is perhaps the only option. I just want to know the expectation on NextGen server.

My perspective — My previous teams (%% mmail%%) created client-server systems in-house where a home-made daemon would run inside a Unix host for a few days before it eventually stops working, for various reasons

* request queue grow and grow and overflow
* no more output in the server log
* extremely slow response
* excessive network latency
* perhaps memory leak
* perhaps deadlocks
* perhaps some threads get stuck due to exessive synchronization and locking
* perhaps thrashing
* core dump

Such degradation symptoms are common in Windows desktops — most of us do a reboot at least once a week or so. Even commercial-grade servers can suffer the same fate. In a high-volume system, we used to restart our iplanet servers every night. Robust and resilient server design is a software industry challenge.

tan bin

10 sound-bytes for sscfi

* complete name: “special service circuit fault isolation”
* tag line “expert system for AUTONOMOUS fault isolation in comms circuits”
* rule-based and model-based
* “sscfi is a model-based expert system. It reads the target circuit’s design to generate an internal circuit model; ….” (Circuit model is my responsibility.)
* object-oriented LISP, C++, Corba, Perl, …
* in service since 1991
* used to run on multiple RS6000/AIX workstations
* how many circuit diagnosis a year? hundreds of thousands.